[Eeglablist] Using ICA components for ML classification (Data leakage concerns)

Wed Jul 26 08:05:06 PDT 2023

Hi,
I have EEG data of 8 individuals from 8 electrodes. Each individual performed a task in two conditions (within-subject design). 
After preprocessing and epoching data, I performed ICA on combined data (the two task conditions combined) for each individual. Thus a total of 64 independent components were estimated in total (8 channels x 8 subjects). 

In the EEGLab "Study", 40 components were included after excluding those with a residual variance >15% and then time-frequency measures were then computed for those 40 components.
After separating the 40 components into two conditions, I get 80 2-dimensional time-frequency matrices (40 components x 2 conditions). 
I was wondering if I can consider these 80 components as independent samples for machine learning classification (to see if the time-frequency "activity" of the two task conditions can be distinguished). My concern is that since ICA was performed on combined data (both task conditions together) and then trials from the components were separated, it would be considered data leakage.
My first question is to confirm if that is a valid concern. 
Since the above would not be a concern if the component activity is temporally independent across epochs, I was wondering if that is the case? (second question)
Third question: In a scenario where the above procedure could lead to data leakage, would it be valid to use the time-frequency measures of e.g. 32 components (from both conditions) for training, and then use 8 components (both condition pairs) for test? In other words, can I consider the 40 components as independent samples?
My interest is in distinguishing the activity from two tasks and losing subject-level information is not a concern.

Kind Regards,
Zaeem Hadi