Salk / CNL publications on ICA applied to EEG and fMRI data
0. What is ICA?
Independent Component Analysis (ICA) is a signal processing domain that
is currently receiving increasing attention. In its simplest form, the
problem is to separate N statistically independent inputs which have
been mixed linearly in N output channels, without further knowledge
about their distributions or dynamics. This is also called a problem of
blind separation. In 1995, Bell & Sejnowski published an elegant neural
network "Infomax" algorithm that can approximate the solution to this ICA
problem in many cases. Later that year, Makeig, Bell, Jung and Sejnowski
showed a first application of ICA to decomposition of spontaneous
multi-channel electroencephalographic (EEG) and averaged event-related
potential (ERP) data sets at the NIPS meeting in Denver
(Makeig et al., 1996).
A toolbox of Matlab routines developed for
conveniently applying the Infomax ICA algorithm
to EEG or MEG (magnetic EEG) data and a set of measurement
and visualization functions for analyzing results of ICA decomposition
of averaged or single-trial ERPs and ERFs (event-related fields),
is available
here.
For simple Matlab ICA kernal code, mathematical details,
and alternate approaches, see the
CNL ICA web pages.
To read or download a Web tutorial on the ICA / EEG toolbox,
Click Here.
For information about changes in the latest version of the toolbox, see
ica.readme.html
The FAQs below are divided into theoretical and practical questions.
1. Is the assumption that EEG is a linear mixture of underlying brain
"sources" correct?
In the EEG frequency range, the mixing of brain fields
at the scalp electrodes is basically linear. The skull attenuates brain
potentials strongly, and "smears" (low-pass filters) them spatially, but
this does not affect the linear sumation (and thus mixing)
of potentials from the brain (and non-brain artifacts)
at the scalp electrodes.
2. If the activations of the brain "sources" are partially overlapping,
what does the ICA algorithm do?
See the answer to the next question...
3. What are brain "sources" of event-related brain responses, anyway?
This is an important question, with seemingly contradictory answers!
First, remember:
____
ICA does not solve the inverse problem.
From the viewpoint of ICA, "brain sources" are NOT necessarily "fMRI hot spots"
NOR "single equivalent dipoles," but rather
"concurrent electromagnetic activity that is spatially fixed
and temporally independent of activity arising in other spatially fixed
sources and summed with it in the input data."
Networks producing such concurrent activity are defined
NOT by compact spatial distributions in the brain, but by the
covarying field measurements they produce at the scalp sensors. In
general, "sources" of ICA components may be (one or more) distributed
brain networks rather than "physically compact active brain regions."
These networks may be functionally linked, forming (possibly transient)
larger network, or may simply be activated concurrently in the input
data, by chance as it were.
When (as often happens when decomposing spontaneous or event-related
single-trial EEG data)
an ICA component map turns out to be compatible with a possible
compact generator region in cortex, this is not because ICA
had this as an objective, but because a coherent signal source, independent
of other sources, projected to the electrodes or SQUID sensors in just
this pattern - a striking coincidence!
In short, ICA produces an independent component analysis
decomposition of the input data rather than an inverse source model
decomposition!
Understanding the difference between these two concepts may require
almost a paradigm shift in the way one thinks about the brain. Is the
brain a collection of physically discrete neural networks which pass
information to each other by (occasional) neural impulses, as we send
letters or email to each other? Or is the brain a dynamically shifting
collection of interpenetrating, distributed, and possibly transient
neural networks that communicate via some form(s) of mass action?
The first viewpoint is that of classical anatomy and physiology. The
second is that of a slowly emerging dynamic systems perspective on
neuroscience. ICA is a tool for discovering independent sources of
spatial dependencies in multichannel data. It asks, and answers, a
different problem than so-called "source localization." The two
questions are complementary, hence the answers they produce may be
complementary parts of "the whole story" of "how brains work."
In truth, brain networks are most probably never wholly autonomous.
They are neither physically wholly isolated from one another, nor do
they act wholly independently! Attempts to solve the inverse problem
(source localization) may assume the first (at some level). ICA assumes
the second. Might decomposition algorithms be derived which mediate
between these two extremes? Probably so, if some intermediate set of
mathematically exact and biologically reasonable assumptions
were developed! For now, ICA algorithms are opening new windows
into event-related brain dynamics.
4. So, if the activations of some "physically separate brain network"
are partially overlapping, what does the ICA algorithm do?
If relatively intense, correlated activity in two component occurring at
one and the same time period is not strongly
diluted by independent activity (including baseline noise activity)
at other times in their activations,
ICA may consider the period of correlated activity to reflect the
emergence of a transitory cooperative source integrating the synchronously active
networks. (Note: ICA deals with higher-order spatial dependencies as well
as correlations). But what, you ask, if these networks "just happen" to be
co-active? How is ICA to know about that?? It tells (as best it can) what
independent components make up the data (if the data allows such a decomposition).
The interpretation of the components is up to the user.
(For some 2-d plots showing the near-independence and near-maximum
entropy of ICA components, see an image of
pairs of ERP-data activations (after logistic compression)
plotted against one another (67k)).
5. How do I know if an ICA component originates in one or more than one
physical region in the brain?
ICA does not provide information beyond giving the scalp topographies
of the independent components. One may attempt to perform source
localization on these maps. In general, fewer active brain networks
may dominate each component map than in the raw scalp data. Thus, ICA
may be useful as a preprocessing step prior to attempting source
localization (using whatever relevant additional assumptions).
6. What is the difference between ICA and PCA?
(A very "FAQ").
PCA (principal component analysis), usually implemented using singular
value decomposition (SVD), finds an orthogonal basis for its
given data. The first eigenvector (and its weight, called the first
principal component) gives the direction (and from this, the scalp
topography) of maximum variance (between channels) in the data (and the
size of the data projection onto this direction). If the data is a mixture
of linear components (as ICA assumes) this direction will, in general,
sum as many of the independent components as possible.
Raw PCA components have orthogonal
activations and scalp distributions, while the goal of ICA is to find
temporally independent components which may have non-orthogonal
(even very similar) scalp distributions.
For example, ICA applied to group mean data from a color experiment
produced two separate series of components of responses to red and
blue checkerboards, respectively. These, naturally, had very similar
(but not identical) scalp maps.
Thus, PCA and ICA have very different goals, and naturally give
quite different results in nearly all cases.
PCA has usually been proposed for EEG analysis essentially as a
preprocessing stage prior to "rotation" of the spatial filters
using Varimax, Promax, Quartimax, or some other criterion,
much like "sphering" is often used to preprocess ICA input data
in the Bell & Sejnowski (1995) algorithm. The Varimax criterion
(maximize the variance of the basis vectors) is indirect at best, and
retains the undesirable feature that the scalp maps it finds must
be spatially orthogonal, whereas nearby but functionally differing
brain areas may produce highly correlated scalp patterns. (Promax is
a postprocessing stage applied on top of Varimax that relaxes Varimax's
orthogonality constraint on the maps slightly).
PCA-based EEG decomposition can be implemented in two ways -- so called
"temporal" and "spatial" PCA approaches. "Temporal" PCA analysis,
such as described by Chapman & McCreary (1994), forces PCA components
to have the same time courses of activation in each experimental condition.
(Mocks has introduced generalizations that somewhat relax this constraint).
"Spatial PCA" and ICA avoid this shortcoming, allowing ICA components to have
different time courses in each condition.
ICA takes into account higher-order dependencies (and independencies) in the
component activations, whereas PCA is based on the second-order covariance
matrix. Spatial PCA assumes orthogonality of the component maps, whereas
ICA maps are not constrained to be orthogonal ("The brain is not orthogonal").
In sum, ICA as implemented in
runica()
can be viewed as a new oblique
rotation method for post-processing of spatial PCA'd (or sphered, or raw) data.
Introducing ICA in this way may be useful when approaching potential users
(or reviewers) better versed in PCA decomposition.
7. What are the ICA algorithm's distributional assumptions
about the values of the component activations,
and how do they affect the output?
This is a complicated question. Bell & Sejnowski (1995) proved that
their algorithm will solve the independent components problem exactly
given several assumptions, among them that the c.d.f. (cumulative
density function) of each of the components matches the nonlinearity
used in the algorithm. We use the logistic function, but have no
proof that brain networks have a logistic c.d.f. In practice,
the algorithm performs well when the components have
higher tails (more peak values) than the normal distribution (e.g., they
are "on" infrequently; technically, their amplitude distributions are
super-Gaussian). Amari et al. address the question of why ICA
algorithms are 'super-efficient' in a recent paper (see Bell's web
page referenced above). Note the relation between the "sparsity" assumption
underlying logistic ICA and the "maximum compactness" assumption used in
Varimax and Promax. In
runica(),
another 'extended-ICA' method of identifying components
with sub-Gaussian component activations is available (see question 9b below).
8. Isn't the EEG known to be Gaussian?
The algorithm will not separate a mixture of Gaussians. However,
central limit arguments suggest that the mixture of non-Gaussian
components may appear Gaussian, particularly if there are a large
number of them (e.g., brain networks with ~1/f temporal and spatial
distributions). In practice, infomax ICA can return good results
on data having small deviations from normality.
9. What are the differences between the
runica()
algorithm in the Makeig
et al. Matlab package and the original Bell & Sejnowski(1995) algorithm?
The main addition is the added W'*W ("WTW") 'natural gradient' term
which avoids matrix inversions during ICA training, greatly reducing
the computational load.
Tony Bell on incorporating
the natural gradient into infomax ICA ..."
Note: Instead of sphering the data prior to training, one can simply
initialize the weight matrix to the sphering matrix. In
runica()
this is the default when 'sphering' is turned 'off'.
We also have reintroduced
online bias adjustments into
runica()
as originally given by Bell & Sejnowski (1995).
For skewed (non-symmetrical)
activation distributions
(very common in ERP analyses), bias adjustment gives better unmixing
results (though in simulations using 20 (mostly small) sources and
6 channels, no advantage for using the bias term was found).
runica()
also incorporates an
optional momentum term
into the weight update loop.
Though initial simulations did not find an advantage for using this term,
it might be of use in some cases.
9b. What is extended-ICA and when should it be used?
The logistic nonlinearity used in
runica()
is capable of
separating component activations that are super-Gaussian (kurtosis>0).
These are sparse, with infrequent periods of intense activation.
Sources whose activations are (for example)
more 'on' than 'off' may have sub-Gaussian
distributions of activation values. An algorithm for treating these
has recently been developed using infomax
(Te-Won Lee,
Mark Girolami & Terry Sejnowski (submitted)).
Te-Won explains
more about extended ICA ...
In
runica()
there are two ways of calling 'extended-ICA' using
the 'extended' keyword. If the following parameter is a positive integer (N),
a kurtosis estimate is run on the activation waveforms every N training
blocks. If the resulting vector of signs does not change for a number of
blocks, the rate of doing (computationally expensive) kurtosis estimates
is reduced.
The number of components separated using the sub-Gaussian nonlinearity
is given by the number of -1's in the signs vector returned by
runica()
and is indicated at each training step in verbose mode.
The number of actual sub-Gaussian component activations can also be found by
counting the number of negative-kurtosis component activations returned by
>> kurt(activations')
A faster but less flexible method for running extended-ICA is to use
a negative integer parameter (-N). In this case, the number of putative
sub-Gaussian components is fixed at N, and online kurtosis estimates are
not used. If the actual number of separable sub-Gaussian components is not
N, results may be difficult to interpret.
A prime impetus for using extended-ICA on psychophysiological data is
to remove line current noise from EEG recordings. The 60- or 50-Hz component
is sub-Gaussian, and can only be efficiently separated using extended-ICA.
(Jung et al,. 1998).
Some eye movement components have also been found to be sub-Gaussian.
10. Can the ICA algorithm be applied to magnetic (MEG) as well as electric (EEG) data?
Yes. The relation between brain fields and fields on the scalp is also
essentially linear, so long as the positions of the sensors and underlying
brain fields are fixed.
11. How many data channels and time points do I need to run the ICA algorithm?
The
runica()
version of the infomax ICA algorithm uses the natural gradient, so
requires very few
matrix inversions and so runs very quickly on modern workstations.
For large datasets (for example, single trials), the binary version
of runica() coded by Sigurd Enghoff,
available here,
is up to twelve times faster than the Matlab version and uses
only a fourth the memory!
In
our experience, the number of time points needed is at least several
times the number of scalp sensors (data channels). We have decomposed
data sets with 2 to 144 channels successfully. The maximum number
of time points is also unlimited. However, see the question following
-- if you ask ICA to decompose lots of responses at once, your number
of data channels (and scalp sensors) should exceed the number of
expected components active in all the input conditions!
If the component activations returned by
runica()
are each large at
one time point only, then the weight matrix has not trained sufficiently,
and you need to retrain using more input data.
Remember that
ICA can only separate components that vary independently
in its input data.
Thus, two components with overlapping but equivalent
activations in each input data condition (e.g., each input ERP) should
not be separated cleanly. To do so, the experimenter would need to add
conditions in which their amplitudes and/or time courses varied
independently. Normally, good experiments are exactly those in which the
phenomena of interest are made to vary independently, making ICA most
useful for analyzing data from scientifically sound experiments.
12. What does the ICA algorithm output?
The
runica()
function outputs two matrices (weights, sphere) which are
the heart of the matter. The "activations" of the ICA components (i.e.
their time courses of activation) are the rows of the matrix
Note that these activations are not scaled to the input data units,
and that their polarities are also not meaningful. Why not? Because the
projection (reconstruction) of each single ICA component's contribution
to the original scalp data is produced by first zeroing out all but the
chosen component (row) in the activation matrix, then multiplying by
the inverse weight matrix:
The rows of this matrix are the time courses of the chosen component at the
respective scalp sensors, in the original input data units and with the
same polarity. The original channel means are not included -- to include them,
The "activations" are the time course of activation (strengths) for each
component. A component "projection" is the actual time course of potential
variations at each of the electrodes produced by the component.
The projections, as computed above, sum to the data
including its row means.
However, the offsets of the various component projections may not have
physiological significance.
Note:
there is no need to save the weight and sphere matrices separately.
If desired, weights can be set equal to weights*sphere, and eye(N) can
then be used by functions asking for a separate sphering matrix
(where N is the number of channels in the data).
13. Does the ICA algorithm always give the same answer when applied to the same data?
In repeated trainings, the Matlab random number generator
will cause
runica()
to deliver the data to
the training algorithm in different random orders. This has little
effect on the outcome; components with large projections are unchanged,
though small components may vary. This is an important reason why
components are ordered in decreasing order of
variance accounted for by their projections onto the scalp. In
runica(),
component polarity is set so that all the activation
waveforms are RMS positive, producing component maps that reflect the
actual polarities of their dominant projection onto the scalp (e.g.,
peak-positive scalp components in red, negative in blue in
compmap()).
However, training the algorithm with different amounts of data may produce
differences in the results. The ICA algorithm "pays attention" to the
relatively large activity in the data. Adding relatively large responses
to the training data and retraining may therefore cause the algorithm to
"pay less attention" to the original training data, resulting in fewer
components active chiefly in the now relatively smaller response conditions.
14. Can I use the ICA output (weight*sphere) matrix from an evoked response
decomposition to filter individual EEG traces?
Yes, but remember that the (weight*sphere) matrix is essentially a set
of spatial filters which in this case were not trained to filter out
the possibly numerous EEG components that were not present (or were
strongly attenuated) in the original averaged training data. For
example, if sufficient averaging succeeded in removing non-phase locked
alpha activity from an averaged ERP, all the filters resulting from an
ICA decomposition of this ERP might well allow the alpha activity in
the single-trial EEG to pass as well as (possibly many) other EEG
components.
15. If the number of brain "sources" is greater than the number of sensors,
what will the ICA algorithm do?
Our head-model simulations show that the presence of relatively large
numbers of small independent components degrades the accuracy of ICA
in finding large large independent components "gracefully" (e.g., the
more or larger the "noise sources" in the simulation, the fewer large
sources are accurately extracted). Of course, if the number of large
independent components making up the simulated data is larger than the
number of sensors, then the ICA results may be poor. This constrains the
amount of input data to be given the algorithm -- throwing in additional
(and unrelated) data (e.g., "the kitchen sink") may not leave enough
component degrees-of-freedom to separate out phenomena of most interest.
However, the algorithm has recently been used to usefully decompose
as many as 75 31-channel ERPs from related conditions.
16. Should the number of outputs equal the number of inputs?
runica()
allows you to specify the number of output channels,
which may be smaller than the number of input channels. However, in our
experience this has not been useful, as the outputs are then forced to
be (typically noisy) mixtures of underlying sources. A better strategy,
in our experience, is to set the number of output channels to be equal
to the number of input channels (this is the default) and then to focus
on components of interest, in particular ignoring
those ICA components whose projections to the scalp are small, or which
represent phenomena of less experimental interest (e.g., eye movements,
line noise, etc.).
17. What kinds of data channels can be input?
So long as the assumed independent components will be mixed linearly in
the data, any kinds of data channels may be input together. For
example, EEG channels with common mastoid reference plus
bipolar EOG channels. Mixing MEG and EEG channels should be possible as well.
18. Should I make the input data mean-zero?
Yes, but
runica()
does this, and saves the channel means removed in the output
argument datamean. ERP and ERF data normally are baseline-zeroed
when they are calculated. Baseline-zeroing each
epoch in multi-epoch data may be accomplished by:
Theoretical Questions
________
ICA does not solve the inverse problem.
____________
ICA does not solve the inverse problem.
____________________
(In general, nothing can!)
Practical Questions
activations = weights*sphere*(input_data - row_means);
projection = inv(weights*sphere) * single_activation;
activations = weights*sphere*input_data; % including row means!
projection = inv(weights*sphere) * single_activation;
In most ERP data, the sources generating the given response are
assumed to be inactive during the baseline period. When the channel means
of the input data differ significantly from the baseline means, making
the data mean-zero prior to ICA training introduces an 'active'
DC (square-wave) component
without physiological significance
into the ICA decomposition.
This problem may may be minimized by appending response epochs without baseline
offsets to the input data (example: responses to standard stimuli appended
to responses to target stimuli containing large monophasic late waves).
(Note: a previous idea, symmetrically doubling the input data
around the baseline mean zero value, proved flawed, since when multiple
skewed source activations are present, this introduces a spurious correlation
between them). New ideas are welcome!
19. Can I decompose data from different sessions collected with different sensor
montages?
This violates the assumption of spatial stationarity of sources,
so is possible only when the sensor placements are very near each other.
For example, grand averages of subject groups from different
experiments all wearing the same electrode cap can be decomposed simultaneously,
(one such decomposition gave a single set early visual components for both
experiments).
ICA decomposition is "montage-free" in the sense
that separate decompositions of the same
components using different sensor montages will find the same (strongest)
components.
20. Can I decompose data averaged across subjects?
Yes. Of course, any averaging procedure is valid only if some physical
uniformity assumptions are met.
21. Does the ICA algorithm work better with clean or noisy data?
Data with more large signal sources than data channels
(whether thought of as 'noise' or 'non-noise' sources)
cannot all be separated by the ICA algorithm,
although ICA may still identify the largest of them accurately.
In general,
ICA is not a tool designed for pulling small signals out of (true) noise,
like techniques that rely on detailed models of the signals to be extracted.
However, very small signals (e.g., a small muscle twitch) can be separated
out of much larger EEG activity when the total number of large mixed sources is
smaller than the number of sensors, and their time course of activation is
quite unlike other data sources.
22. Where can I learn more about ICA applied to evoked responses?
Papers on ICA applied to psychophysiological data are listed on
an ICA bibliography page.
23. Reference for the ICA toolbox:
Comments and suggestions on this tutorial
Makeig, S et al. "ICA Toolbox for Psychophysiological Research". WWW Site,
Computational Neurobiology Laboratory, The Salk Institute for Biological Studies,
www.cnl.salk.edu/~scott/ica.html [World Wide Web Publication], Dec. 13, 2000
Reference for this FAQ:
Makeig, S et al. "Frequently Asked Questions about ICA applied to EEG and MEG data".
WWW Site, Computational Neurobiology Laboratory, The Salk Institute for Biological
Studies, www.cnl.salk.edu/~scott/icafaq.html [World Wide Web Publication], Dec. 13, 2000
are welcome. Email scott@salk.edu