[Eeglablist] Using ICA components for ML classification (Data leakage concerns)

Jason Palmer japalmer29 at gmail.com
Wed Jul 26 14:46:38 PDT 2023


Hi Zaeem,

The problem is not combining tasks horizontally, it's combining subjects vertically. You are associating arbitrary epochs from different subjects at different times.

8 channels is not sufficient for ICA. There are more than 8 sources so you can't separate them with 8 channels. There will also be more than 64 components in 8 subj x 8 channel data.

Maybe if you did extreme low pass filtering to supposedly reduce the data to approximately 8 high amplitude sources, and assume the same the same spatial distribution for the 8 components across subjects, you could combine subjects horizontally, not vertically, as well.

Best
Jason
________________________________
From: eeglablist <eeglablist-bounces at sccn.ucsd.edu> on behalf of Zaeem Hadi via eeglablist <eeglablist at sccn.ucsd.edu>
Sent: Wednesday, July 26, 2023 11:05:06 AM
To: eeglablist at sccn.ucsd.edu <eeglablist at sccn.ucsd.edu>
Subject: [Eeglablist] Using ICA components for ML classification (Data leakage concerns)

Hi,
I have EEG data of 8 individuals from 8 electrodes. Each individual performed a task in two conditions (within-subject design).
After preprocessing and epoching data, I performed ICA on combined data (the two task conditions combined) for each individual. Thus a total of 64 independent components were estimated in total (8 channels x 8 subjects).

In the EEGLab "Study", 40 components were included after excluding those with a residual variance >15% and then time-frequency measures were then computed for those 40 components.
After separating the 40 components into two conditions, I get 80 2-dimensional time-frequency matrices (40 components x 2 conditions).
I was wondering if I can consider these 80 components as independent samples for machine learning classification (to see if the time-frequency "activity" of the two task conditions can be distinguished). My concern is that since ICA was performed on combined data (both task conditions together) and then trials from the components were separated, it would be considered data leakage.
My first question is to confirm if that is a valid concern.
Since the above would not be a concern if the component activity is temporally independent across epochs, I was wondering if that is the case? (second question)
Third question: In a scenario where the above procedure could lead to data leakage, would it be valid to use the time-frequency measures of e.g. 32 components (from both conditions) for training, and then use 8 components (both condition pairs) for test? In other words, can I consider the 40 components as independent samples?
My interest is in distinguishing the activity from two tasks and losing subject-level information is not a concern.


Kind Regards,
Zaeem Hadi



_______________________________________________
Eeglablist page: http://sccn.ucsd.edu/eeglab/eeglabmail.html
To unsubscribe, send an empty email to eeglablist-unsubscribe at sccn.ucsd.edu
For digest mode, send an email with the subject "set digest mime" to eeglablist-request at sccn.ucsd.edu



More information about the eeglablist mailing list