[Eeglablist] Using ICA components for ML classification (Data leakage concerns)

Fri Jul 28 17:23:11 PDT 2023

 Hi Jason,
I am trying to check the classification accuracy of a Conv-net for classifying ERSP activity belonging to two-within-subject task conditions.
A single component after ICA on an individual contains trials belonging to each task (that's what I was referring to: a component containing trials belonging to both tasks for a single individual).
Taking the example of a single component, the question was, whether each component as a whole (combined trials) should go into train or test data as a single training example, or can I separate the two types of trials (Task 1 trials and Task 2 trials) from a single component and then consider each as a separate training example (so all trials from task 1 might randomly go to train data whereas trials from task 2 into test data).

Maybe I am misunderstanding the concept of temporal independence. But if a component is composed of temporally independent activity, if I separate two task trials from a single component and consider them as two independent samples, there shouldn't be a concern of data leakage (just thinking out loud).
I am sorry for the confusion in my explanation and the long thread.
Thank you for your response.
Best wishes,
Zaeem    On Friday, July 28, 2023 at 09:43:25 PM GMT+1, Jason Palmer <japalmer29 at gmail.com> wrote:  

 What kind of classification are you doing? Are you classifying trial types? I don't follow combining tasks and having two training sets from different components.
From: Zaeem Hadi <zaeemhadi at ymail.com>
Sent: Wednesday, July 26, 2023 11:13:39 PM
To: Arnaud Delorme <adelorme at ucsd.edu>
Cc: eeglablist at sccn.ucsd.edu <eeglablist at sccn.ucsd.edu>; japalmer29 at gmail.com <japalmer29 at gmail.com>
Subject: Re: [Eeglablist] Using ICA components for ML classification (Data leakage concerns)

Dear Arnaud and Jason,
Thank you both for your responses. I just wanted to confirm again if I am understanding this correctly. I also acknowledge the limitations of the dataset but unfortunately, it is a slightly old dataset that I collected with a wireless EEG setup that we had available at the time.
The ICA was performed for each subject individually and after adding the resulting components into "Study", I then separated components into two (within-subject conditions). I am extracting the pre-computed measures for these two conditions for the purpose of classification (statistics was performed).
>From Jason's response, it seems it is reasonable to have ERSP measures of one complete component (task1 + task2) as an independent sample for training the classifier.
So, the following horizontal combination would be fine 
training sample 1:  IC1 Task 1 ERSP | IC1 Task 2 ERSP training sample 2:  IC2 Task 1ERSP | IC2 Task 2 ERSP
but following vertical combination is not?
 training sample 1:  IC1 Task 1 ERSPtraining sample 2:  IC1 Task 2 ERSPtraining sample 3:  IC2 Task 1 ERSPtraining sample 4:  IC2 Task 2 ERSP

Kind Regards,
Zaeem Hadi

On Thursday, July 27, 2023 at 03:13:35 AM GMT+1, Arnaud Delorme via eeglablist <eeglablist at sccn.ucsd.edu> wrote:

To rephrase Jason’s response, you need to run ICA on each subject individually. So for each subject, you will get 8 components.
In EEGLAB, you can then create a STUDY with your 8 individuals and perform statistics.

I agree that 8 components are not sufficient to process the data. The components might capture some artifacts, but it is hard to tell from their topography - you should focus on the components' time course and spectral profile. Even though you could technically use ICLabel to flag components, I would not. The minimum number of channels ICLabel was trained with was 19, I think. Running it on data with 8 channels is going to return nonsense.

Best wishes,

Arno

> On Jul 26, 2023, at 11:46 AM, Jason Palmer via eeglablist <eeglablist at sccn.ucsd.edu> wrote:
> 
> Hi Zaeem,
> 
> The problem is not combining tasks horizontally, it's combining subjects vertically. You are associating arbitrary epochs from different subjects at different times.
> 
> 8 channels is not sufficient for ICA. There are more than 8 sources so you can't separate them with 8 channels. There will also be more than 64 components in 8 subj x 8 channel data.
> 
> Maybe if you did extreme low pass filtering to supposedly reduce the data to approximately 8 high amplitude sources, and assume the same the same spatial distribution for the 8 components across subjects, you could combine subjects horizontally, not vertically, as well.
> 
> Best
> Jason
> ________________________________
> From: eeglablist <eeglablist-bounces at sccn.ucsd.edu> on behalf of Zaeem Hadi via eeglablist <eeglablist at sccn.ucsd.edu>
> Sent: Wednesday, July 26, 2023 11:05:06 AM
> To: eeglablist at sccn.ucsd.edu <eeglablist at sccn.ucsd.edu>
> Subject: [Eeglablist] Using ICA components for ML classification (Data leakage concerns)
> 
> Hi,
> I have EEG data of 8 individuals from 8 electrodes. Each individual performed a task in two conditions (within-subject design).
> After preprocessing and epoching data, I performed ICA on combined data (the two task conditions combined) for each individual. Thus a total of 64 independent components were estimated in total (8 channels x 8 subjects).
> 
> In the EEGLab "Study", 40 components were included after excluding those with a residual variance >15% and then time-frequency measures were then computed for those 40 components.
> After separating the 40 components into two conditions, I get 80 2-dimensional time-frequency matrices (40 components x 2 conditions).
> I was wondering if I can consider these 80 components as independent samples for machine learning classification (to see if the time-frequency "activity" of the two task conditions can be distinguished). My concern is that since ICA was performed on combined data (both task conditions together) and then trials from the components were separated, it would be considered data leakage.
> My first question is to confirm if that is a valid concern.
> Since the above would not be a concern if the component activity is temporally independent across epochs, I was wondering if that is the case? (second question)
> Third question: In a scenario where the above procedure could lead to data leakage, would it be valid to use the time-frequency measures of e.g. 32 components (from both conditions) for training, and then use 8 components (both condition pairs) for test? In other words, can I consider the 40 components as independent samples?
> My interest is in distinguishing the activity from two tasks and losing subject-level information is not a concern.
> 
> 
> Kind Regards,
> Zaeem Hadi
> 
> 
> 
> _______________________________________________
> Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> To unsubscribe, send an empty email to eeglablist-unsubscribe at sccn.ucsd.edu
> For digest mode, send an email with the subject "set digest mime" to eeglablist-request at sccn.ucsd.edu
> _______________________________________________
> Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
> To unsubscribe, send an empty email to eeglablist-unsubscribe at sccn.ucsd.edu
> For digest mode, send an email with the subject "set digest mime" to eeglablist-request at sccn.ucsd.edu

_______________________________________________
Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
To unsubscribe, send an empty email to eeglablist-unsubscribe at sccn.ucsd.edu
For digest mode, send an email with the subject "set digest mime" to eeglablist-request at sccn.ucsd.edu