[Eeglablist] Using ICA components for ML classification (Data leakage concerns)

Zaeem Hadi zaeemhadi at ymail.com
Wed Jul 26 20:13:35 PDT 2023


 
Dear Arnaud and Jason,
Thank you both for your responses. I just wanted to confirm again if I am understanding this correctly. I also acknowledge the limitations of the dataset but unfortunately, it is a slightly old dataset that I collected with a wireless EEG setup that we had available at the time.
The ICA was performed for each subject individually and after adding the resulting components into "Study", I then separated components into two (within-subject conditions). I am extracting the pre-computed measures for these two conditions for the purpose of classification (statistics was performed).
>From Jason's response, it seems it is reasonable to have ERSP measures of one complete component (task1 + task2) as an independent sample for training the classifier.
So, the following horizontal combination would be fine 
training sample 1:  IC1 Task 1 ERSP | IC1 Task 2 ERSP training sample 2:  IC2 Task 1 ERSP | IC2 Task 2 ERSP
but following vertical combination is not?
 training sample 1:  IC1 Task 1 ERSPtraining sample 2:  IC1 Task 2 ERSPtraining sample 3:  IC2 Task 1 ERSPtraining sample 4:  IC2 Task 2 ERSP

Kind Regards,
Zaeem Hadi

    On Thursday, July 27, 2023 at 03:13:35 AM GMT+1, Arnaud Delorme via eeglablist <eeglablist at sccn.ucsd.edu> wrote:  
 
 To rephrase Jason’s response, you need to run ICA on each subject individually. So for each subject, you will get 8 components.
In EEGLAB, you can then create a STUDY with your 8 individuals and perform statistics.

I agree that 8 components are not sufficient to process the data. The components might capture some artifacts, but it is hard to tell from their topography - you should focus on the components' time course and spectral profile. Even though you could technically use ICLabel to flag components, I would not. The minimum number of channels ICLabel was trained with was 19, I think. Running it on data with 8 channels is going to return nonsense.

Best wishes,

Arno

> On Jul 26, 2023, at 11:46 AM, Jason Palmer via eeglablist <eeglablist at sccn.ucsd.edu> wrote:
> 
> Hi Zaeem,
> 
> The problem is not combining tasks horizontally, it's combining subjects vertically. You are associating arbitrary epochs from different subjects at different times.
> 
> 8 channels is not sufficient for ICA. There are more than 8 sources so you can't separate them with 8 channels. There will also be more than 64 components in 8 subj x 8 channel data.
> 
> Maybe if you did extreme low pass filtering to supposedly reduce the data to approximately 8 high amplitude sources, and assume the same the same spatial distribution for the 8 components across subjects, you could combine subjects horizontally, not vertically, as well.
> 
> Best
> Jason
> ________________________________
> From: eeglablist <eeglablist-bounces at sccn.ucsd.edu> on behalf of Zaeem Hadi via eeglablist <eeglablist at sccn.ucsd.edu>
> Sent: Wednesday, July 26, 2023 11:05:06 AM
> To: eeglablist at sccn.ucsd.edu <eeglablist at sccn.ucsd.edu>
> Subject: [Eeglablist] Using ICA components for ML classification (Data leakage concerns)
> 
> Hi,
> I have EEG data of 8 individuals from 8 electrodes. Each individual performed a task in two conditions (within-subject design).
> After preprocessing and epoching data, I performed ICA on combined data (the two task conditions combined) for each individual. Thus a total of 64 independent components were estimated in total (8 channels x 8 subjects).
> 
> In the EEGLab "Study", 40 components were included after excluding those with a residual variance >15% and then time-frequency measures were then computed for those 40 components.
> After separating the 40 components into two conditions, I get 80 2-dimensional time-frequency matrices (40 components x 2 conditions).
> I was wondering if I can consider these 80 components as independent samples for machine learning classification (to see if the time-frequency "activity" of the two task conditions can be distinguished). My concern is that since ICA was performed on combined data (both task conditions together) and then trials from the components were separated, it would be considered data leakage.
> My first question is to confirm if that is a valid concern.
> Since the above would not be a concern if the component activity is temporally independent across epochs, I was wondering if that is the case? (second question)
> Third question: In a scenario where the above procedure could lead to data leakage, would it be valid to use the time-frequency measures of e.g. 32 components (from both conditions) for training, and then use 8 components (both condition pairs) for test? In other words, can I consider the 40 components as independent samples?
> My interest is in distinguishing the activity from two tasks and losing subject-level information is not a concern.
> 
> 
> Kind Regards,
> Zaeem Hadi
> 
> 
> 
> _______________________________________________
> Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> To unsubscribe, send an empty email to eeglablist-unsubscribe at sccn.ucsd.edu
> For digest mode, send an email with the subject "set digest mime" to eeglablist-request at sccn.ucsd.edu
> _______________________________________________
> Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> To unsubscribe, send an empty email to eeglablist-unsubscribe at sccn.ucsd.edu
> For digest mode, send an email with the subject "set digest mime" to eeglablist-request at sccn.ucsd.edu

_______________________________________________
Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
To unsubscribe, send an empty email to eeglablist-unsubscribe at sccn.ucsd.edu
For digest mode, send an email with the subject "set digest mime" to eeglablist-request at sccn.ucsd.edu  


More information about the eeglablist mailing list