The questions below have recently come by email from researchers who are considering applying ICA to their ERP averages. They have given me permission to post this dialogue, since others may have similar questions. -Scott Makeig

How Much Data is Enough?


> Others here and myself have been exploring the use of ICA for
> some of our research.  [Is it true that] one must have at least
> the square of the number of sensors to meet the necessary assumptions
> of ICA?  Because this analysis allows the separation of overlapping
> components of the recorded signal, it is ideally suited for addressing
> some of the questions we're looking at, but we would like to ensure
> that we're "doing it right".
> 
> Since we have 300 points per channel, and 129 channels, we don't 
> have sufficient number of samples to extract 129 components.  
> However, in the MatLab program, there is the ability to limit the 
> number of components via an initial PCA procedure.  We have 
> been using this to limit our number of components, usually to 
> somewhere between 6 and 20, though it appears 17 should be our 
> maximum.  
> 
> My question is, if one has (as we do) 300 data points per electrode 
> and 129 electrodes, does limiting the number of components to a 
> maximum of 17 via the PCA satisfy the criterion for number of 
> samples and allow us to perform a valid ICA?  Or, is it more 
> advisable to limit the number of electrodes by taking a subset over 
> the head as well?

There is no fixed limit to the number of points needed for a "good" ICA solution - and in fact no fixed way to judge whether an ICA solution is "good" or not. However, given 129 channels and, say, 129 points, infomax will usually find the obvious degenerate solution, one component fitting each time point. This solution will give you *no* additional information or insight into your data, and therefore be useless - or possibly worse, if you try to over-interpret it!

In yhour case there are 129^2 unmixed matrix weights for ICA to learn. My best guess is that the number of data points required may be some multiple of 129^2 (this could differ in some cases, depending on the component mixture -- Also, not all data may be modeled as the sum of independent components!). The PCA option is provided as a principled though imperfect way to make the training tractable for large numbers of channels.

However, I think you should rethink the appropriateness of your experimental design for ICA. Infomax looks for independence -- this means things that happen independently of each other. This means, roughly, things that happen at different times or in different ways at different times.

Your 300-point EEG data with its low-pass spectral character actually contain only 5-10? independent time windows. ICA could extract this many independent components, however, only if each component was active in only one independent time window. If this were the case, you might do better by just measuring their separate manifestations (by their peaks or other features).

Note that in our 1999 J Neurosci and J Royal Society papers on visual P3 and N1 decompositions (Makeig et al., 1999ab), we decomposed twenty-five 256-point 31-channel ERP condition grand averages simultaneously. Across this array of conditions, several EEG phenomena were expressed at different times/conditions - allowing ICA to separate them. Also, the number of input data points (25*256 = 6400) was over six times the number of weights to be learned (31^2 = 961).

Are there other conditional averages you could concatenate into your training data, such as non-target responses in each of N stimulus conditions, or etc?

Note: We are now exploring intensively another, most interesting, exciting and challenging approach of decomposing the whole collection of *single trials* from a subject. We have begun to publish on the results of this method, which requires and suggests a somewhat different framework for assumptions and interpretation than the usual framework for thinking about averaged ERPs. See our tutorial pages on single-trial analysis as well as our latest abstracts.


> Just to be sure I follow you here: So this means that even with the 
> PCA limiting the outputted number of components to say 6, if we 
> include all 129 channels of data, we still should have about 129^2 
> data points. 

No, after selecting the largest 6 principal components, the ICA weight matrix is 6^2, so a few hundred points would probably be sufficient -- if the data fit the ICA model. However, there may not be 6 temporally independent components in the few hundred points (as I argue above). Also, PCA is a blunt tool for compressing data, potentially allowing phenomena of interest to be removed from the data or futher mixed.

Another researcher wrote ...


> The data was composed of 26 channels, 230 epochs of 500 samples.
> Therefore I passed a matrix of 26 x 160000 to the function.

You are training 26^2=676 weights with 15k points - 22 points per weight. This is quite sufficient to return components with compact source maps and distinctive dynamics, in our experience.

Should I Concatenate Data Conditions?


> Would it be more appropriate, therefore, to concatenate our 
> conditions to get sufficient numbers of data points, and then 
> perform an ICA?

A: Yes, this is what I'm suggesting.

Should I worry about discontinuities?


> I have not attempted to match the ends of the epoch to each other.
> I just assumed that the discontinuities will be mapped to the largest 
> component.

But ICA does not see discontinuities - In fact, it shuffles all the time points randomly before each training step! This is, for example, quite unlike FFTs, which operate on time series and does not 'see' between-channel relationships. ICA (operating on EEG data) sees only an unordered pile of maps!


> Would it be possible to "pad" a data set and concatenate this "mock" data, 
> similiar to padding to a multiple of two for an FFT?

A: No. Zero-padding here makes no sense, nor does noise padding. Only adding extra conditions in which (mainly) the same set of sources act (somewhat) differently with respect to each other will help.

If I use real input, why do I get complex output?


>  I am using the toolbox for studying late-potentials (>300 ms) in mismatch
>  negativity ERP data.  I am trying to remove the components that are not 
>  significant in the frontal, parietal and occipital areas.
>  The data is 26-channel, 21 EEG + EKG + Eyes + EMG
>  
>  My problem is that "runica" sometime produces complex (in the math sense)
>  results and takes 11 hours to compute.
>  The norm of the imaginary components is 1/200 of the real components.
>  Before running runica() I have:
>  (1) removed the baseline from each epoch using the prestimulus data
>  (2) discarded epochs with amplitudes of over 50 uV in the CZ channel
>  (3) converted the data to average reference using averef() 
>  (4) removed the baseline again using rmbase() 
>  
>  Am I omitting some crucial preprocessing ?

Failure to converge and complex results usually means your data is not of complete rank. Run >> rank(data) to test this. Applying averef() reduces the data dimension by 1, so running runica() (or better, binica()) with the 'PCA' option set appropriately (e.g., to nchans-1 or lower) is necessary.

In general, I do not believe using averef() is a good idea (unless you have really high-density, whole-head recordings). If you want, you can use averef() at the end of the analysis instead of the beginning.

Are four channels enough?


> We are currently trying to analyse ERP data collected from an
> experiment involving rather complicated paradigm which includes mismatch,
> emotional / non-emotional stimuli, as well as infrequent stimuli
> (oddball-paradigm-like). However, we have only 4 channels. As you have
> advised on your web page, we should have more channels than the expected
> number of components. 
> 
> From the results of your experiments (J Neurosci 19:2665-2680 (1999)), 
> you found 4 psychologically meaningful LP components.  Although we have 
> a different paradigm (hence possibly different number of expected components), 
> would it still be advisable to perform ICA on the data to obtain meaningful 
> results?
> 
> My supervisor is quite keen to perform ICA on the data. However I am personally 
> against it, as I expect more components (including 'noise' components) than 
> we have channels. Please advise.

I agree with you that ICA cannot be expected to give optimum results applied to a collection of ERPs with only four channels. However, as an exploratory measure, ICA is easy to perform and visualize (see the new toolbox tutorial). The question is how much belief to put in the functional independence of the resulting components. This should not in any case be blind faith, but faith won through finding convergent behavioral and other physiological evidence.

If you are serious about squeezing maximum information out of 4-channel data, I would advise using short-time moving-window ICA applied to the raw data. This will give a huge collection of components, which must be clustered, etc. (not easy). You could also investigate blind deconvolution applied to the single trials, though this is full of circular confounds, etc... It might be easier to collect the data again with higher density, somewhere. Good luck!

What is temporal ICA?


> As you are aware, there are two implementations of PCA -- temporal
> and spatial. From reading your webpage, I get the impression that spatial
> PCA and ICA are superior form of temporal PCA. But from some of my readings
> I got the idea that spatial PCA (and possibly ICA) and temporal PCA answers
> different questions altogether. Can you knindly clarify this? Also, is there
> a 'temporal PCA'-equivalent form of ICA? Thanks for your time.

These terms are confusing. What you call "spatial ICA" looks at collections of maps (space) to find temporally independent basis maps. In what you call "temporal ICA," the algorithm looks at collections of time courses to find spatially independent basis maps!

What you call temporal ICA is used for fMRI analysis (see McKeown et al., Human Brain Mapping '98, where it makes sense. For EEG it does not, since ~no EEG channel is independent of any other.

As for spatial/temporal PCA, I do believe the point I made in the JNS '99 paper. As Promax (not PCA) tries to minimize the support of each component, spatial Promax will attempt to find superficial sources (projecting strongly to only a few electrodes). Therefore spatial Promax sources have features like 'nipples.' As Chapman and McCrary pointed out, temporal Promax makes more sense, since in this case the component have minimum temporal support - i.e. are 'on' for a minimal period of time.

Be wary of claims that researchers are using "PCA." Note what rotation method they are in fact using. In fact, Varimax, an intermediate step in the Promax routine, does not require PCA pre-processing (Moecks), and so PCA itself plays little or no essential role in results of Varimax or Promax!

PCA itself tries to gather as much activity as possible into the first (and then into second, etc) components, constrained by the quite unreasonable assumption that the scalp maps are orthogonal. PCA may be used for dimension reduction (e.g. prior to ICA), though by removing the many smallest principal components one runs the risk of removing small details of interest.

The same cautions about component interpretation apply to Promax (or other -max) results as apply to (Infomax) ICA. Again, faith in the meaning of components found by any liniear decomposition should not in any case be blind faith, but faith won through discovering and confirming the reliability of convergent behavioral and other physiological evidence.

- Scott

 

Tutorial Outline