[Eeglablist] Using ICA components for ML classification (Data leakage concerns)

Thu Jul 27 06:03:05 PDT 2023

Zaeem and all -

For those looking for EEG data to work with, our NEMAR.org project now
serves over 200 public datasets (mostly EEG, also MEG and iEEG). When
needed, we are attempting to work with data authors to complete their
documentation and formal BIDS/HED annotation to make their further analysis
straightforward. Anyone looking for assistance in making their data public
may contact us.

Scott Makeig

On Wed, Jul 26, 2023 at 11:32 PM Zaeem Hadi via eeglablist <
eeglablist at sccn.ucsd.edu> wrote:

>
> Dear Arnaud and Jason,
> Thank you both for your responses. I just wanted to confirm again if I am
> understanding this correctly. I also acknowledge the limitations of the
> dataset but unfortunately, it is a slightly old dataset that I collected
> with a wireless EEG setup that we had available at the time.
> The ICA was performed for each subject individually and after adding the
> resulting components into "Study", I then separated components into two
> (within-subject conditions). I am extracting the pre-computed measures for
> these two conditions for the purpose of classification (statistics was
> performed).
> From Jason's response, it seems it is reasonable to have ERSP measures of
> one complete component (task1 + task2) as an independent sample for
> training the classifier.
> So, the following horizontal combination would be fine
> training sample 1:  IC1 Task 1 ERSP | IC1 Task 2 ERSP training sample
> 2:  IC2 Task 1 ERSP | IC2 Task 2 ERSP
> but following vertical combination is not?
>  training sample 1:  IC1 Task 1 ERSPtraining sample 2:  IC1 Task
> 2 ERSPtraining sample 3:  IC2 Task 1 ERSPtraining sample 4:  IC2 Task 2 ERSP
>
> Kind Regards,
> Zaeem Hadi
>
>     On Thursday, July 27, 2023 at 03:13:35 AM GMT+1, Arnaud Delorme via
> eeglablist <eeglablist at sccn.ucsd.edu> wrote:
>
>  To rephrase Jason’s response, you need to run ICA on each subject
> individually. So for each subject, you will get 8 components.
> In EEGLAB, you can then create a STUDY with your 8 individuals and perform
> statistics.
>
> I agree that 8 components are not sufficient to process the data. The
> components might capture some artifacts, but it is hard to tell from their
> topography - you should focus on the components' time course and spectral
> profile. Even though you could technically use ICLabel to flag components,
> I would not. The minimum number of channels ICLabel was trained with was
> 19, I think. Running it on data with 8 channels is going to return nonsense.
>
> Best wishes,
>
> Arno
>
> > On Jul 26, 2023, at 11:46 AM, Jason Palmer via eeglablist <
> eeglablist at sccn.ucsd.edu> wrote:
> >
> > Hi Zaeem,
> >
> > The problem is not combining tasks horizontally, it's combining subjects
> vertically. You are associating arbitrary epochs from different subjects at
> different times.
> >
> > 8 channels is not sufficient for ICA. There are more than 8 sources so
> you can't separate them with 8 channels. There will also be more than 64
> components in 8 subj x 8 channel data.
> >
> > Maybe if you did extreme low pass filtering to supposedly reduce the
> data to approximately 8 high amplitude sources, and assume the same the
> same spatial distribution for the 8 components across subjects, you could
> combine subjects horizontally, not vertically, as well.
> >
> > Best
> > Jason
> > ________________________________
> > From: eeglablist <eeglablist-bounces at sccn.ucsd.edu> on behalf of Zaeem
> Hadi via eeglablist <eeglablist at sccn.ucsd.edu>
> > Sent: Wednesday, July 26, 2023 11:05:06 AM
> > To: eeglablist at sccn.ucsd.edu <eeglablist at sccn.ucsd.edu>
> > Subject: [Eeglablist] Using ICA components for ML classification (Data
> leakage concerns)
> >
> > Hi,
> > I have EEG data of 8 individuals from 8 electrodes. Each individual
> performed a task in two conditions (within-subject design).
> > After preprocessing and epoching data, I performed ICA on combined data
> (the two task conditions combined) for each individual. Thus a total of 64
> independent components were estimated in total (8 channels x 8 subjects).
> >
> > In the EEGLab "Study", 40 components were included after excluding those
> with a residual variance >15% and then time-frequency measures were then
> computed for those 40 components.
> > After separating the 40 components into two conditions, I get 80
> 2-dimensional time-frequency matrices (40 components x 2 conditions).
> > I was wondering if I can consider these 80 components as independent
> samples for machine learning classification (to see if the time-frequency
> "activity" of the two task conditions can be distinguished). My concern is
> that since ICA was performed on combined data (both task conditions
> together) and then trials from the components were separated, it would be
> considered data leakage.
> > My first question is to confirm if that is a valid concern.
> > Since the above would not be a concern if the component activity is
> temporally independent across epochs, I was wondering if that is the case?
> (second question)
> > Third question: In a scenario where the above procedure could lead to
> data leakage, would it be valid to use the time-frequency measures of e.g.
> 32 components (from both conditions) for training, and then use 8
> components (both condition pairs) for test? In other words, can I consider
> the 40 components as independent samples?
> > My interest is in distinguishing the activity from two tasks and losing
> subject-level information is not a concern.
> >
> >
> > Kind Regards,
> > Zaeem Hadi
> >
> >
> >
> > _______________________________________________
> > Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> > To unsubscribe, send an empty email to
> eeglablist-unsubscribe at sccn.ucsd.edu
> > For digest mode, send an email with the subject "set digest mime" to
> eeglablist-request at sccn.ucsd.edu
> > _______________________________________________
> > Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> > To unsubscribe, send an empty email to
> eeglablist-unsubscribe at sccn.ucsd.edu
> > For digest mode, send an email with the subject "set digest mime" to
> eeglablist-request at sccn.ucsd.edu
>
> _______________________________________________
> Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> To unsubscribe, send an empty email to
> eeglablist-unsubscribe at sccn.ucsd.edu
> For digest mode, send an email with the subject "set digest mime" to
> eeglablist-request at sccn.ucsd.edu
> _______________________________________________
> Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> To unsubscribe, send an empty email to
> eeglablist-unsubscribe at sccn.ucsd.edu
> For digest mode, send an email with the subject "set digest mime" to
> eeglablist-request at sccn.ucsd.edu

-- 
Scott Makeig, Research Scientist and Director, Swartz Center for
Computational Neuroscience, Institute for Neural Computation, University of
California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott