[Eeglablist] Running AMICA in a super-computer

Tyler Grummett tyler.grummett at flinders.edu.au
Wed Jul 26 00:07:17 PDT 2017


Dear Jason,


Thank you for taking the time to reply to my email.


Im running the jobs on my university's super computer. To my knowledge we dont do cluster computing, each computer has a high number of processors. So each job that is run asks for a certain number of threads (or slots) and it is run on a single computer with the number of processors requested.


The first few lines in the AMICA output are:

           1 processor name = c106.csem.flinders.edu.au
           1 host_num =    190898383
           This is MPI process           1 of           1 ; I am process           1 of
           1 on node: c106.csem.flinders.edu.au
           1  : node root process           1 of           1


The qsub command that I run is:

qsub -t 1:93:1 -m e -M grum0003 at flinders.edu.au -S /bin/bash -cwd -q all.q -l matlab=1 -pe threaded 16 Scripts/AMICA_script.txt 'Control' 'phd' 'false' 'false'

The matlab line of code that is run is:
[ EEG_merged.icaweights, EEG_merged.icasphere, ~] = runamica15( EEG_merged.datfile, ...
        'pcakeep', PCA_comps, 'num_chans', EEG_merged.nbchan, 'max_threads', Nthreads, ...
        'max_iter', max_iter, 'outdir', amicaout_folderpath);

The 'amica_win' folder is copied to the subject's directory, added to the path, and then deleted after the function is finished to ensure that no other process is using the same function. Currently I am running 45 subjects on our supercomputer using 16 threads each, and only one of them has crashed. When I was using 8 threads on each job, I was able to run 90 subjects at once, which resulted in a larger number of crashes. So it does appear as though the more subjects (jobs) I run, the more it crashes, even though each job is using its own runamica15 and outputting to a folder which is unique to it.

Hopefully I have provided enough information. Let me know if I havent.

Regards,
Tyler


*************************

Tyler Grummett ( BBSc, BSc(Hons I))
PhD Candidate
Brain Signals Laboratory
Flinders University
Rm 5A301
Ext 66125
________________________________
From: Jason Palmer <japalmer29 at gmail.com>
Sent: Wednesday, 26 July 2017 4:44:37 AM
To: Tyler Grummett; 'EEGLABLIST'
Subject: RE: [Eeglablist] Running AMICA in a super-computer

Hi Tyler,

Are you running this on the SCCN cluster or your own cluster? Could you tell me what command you are using to run amica in matlab, and if it is your own cluster setup, what mpirun or mpiexec command is running, whether it is sge or torque and what if any qsub script is being run?

Also the first lines of the stdout generated by Amica where it identifies the processes and their respective nodes …

I wonder if you are actually running multiple (overlapping) copies of the same run on several nodes, as can happen with the mpirun commands in some environment setups, instead of having the processes run in concert and communicating.

Best,
Jason

From: eeglablist [mailto:eeglablist-bounces at sccn.ucsd.edu] On Behalf Of Tyler Grummett
Sent: Monday, July 24, 2017 9:00 AM
To: EEGLABLIST
Subject: [Eeglablist] Running AMICA in a super-computer


Dear eeglab,



Recently Im facing a problem when running multiple AMICAs in a super-computer at once, where I get the following error:

...
 iter   258 lrate =  1.0000000000 LL =  -0.0238247322 nd =  0.0001777274, D =   0.89129E-01  0.89129E-01  ( 33.34 s, 182.8 h)
 iter   259 lrate =  1.0000000000 LL =  -0.0238213053 nd =  0.0001768634, D =   0.89103E-01  0.89103E-01  ( 34.18 s, 187.4 h)
 iter   260 lrate =  1.0000000000 LL =  -0.0238178712 nd =  0.0001762253, D =   0.89076E-01  0.89076E-01  ( 33.08 s, 181.4 h)
forrtl: Permission denied
forrtl: severe (28): CLOSE error, unit 19, file "Unknown"
Image              PC                Routine            Line        Source
amica15ub          00000000010D4003  Unknown               Unknown  Unknown
amica15ub          00000000010D1A6D  Unknown               Unknown  Unknown
amica15ub          0000000000441780  Unknown               Unknown  Unknown
amica15ub          0000000000419E08  Unknown               Unknown  Unknown
amica15ub          00000000004021DE  Unknown               Unknown  Unknown
amica15ub          000000000118C1A4  Unknown               Unknown  Unknown
amica15ub          00000000004020C1  Unknown               Unknown  Unknown


However, this error doesn the error doesnt generate an error in matlab so it tries to proceed forward and uses the ICA weights from the most recent iteration. Obviously this isnt ideal, as AMICA requires a lot of iterations to get good ICs.



Does anyone have any idea how to fix this or find a work around? I can confirm that there are no jobs using the same temporary file, temporary output folder, or any of the functions that are run in AMICA. They are all copied to their own folder and run independently. At least thats what I hope is happening.



Im running out of things to try.



Regards,

Tyler


*************************

Tyler Grummett ( BBSc, BSc(Hons I))
PhD Candidate
Brain Signals Laboratory
Flinders University
Rm 5A301
Ext 66125
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://sccn.ucsd.edu/pipermail/eeglablist/attachments/20170726/931c62f7/attachment.html>


More information about the eeglablist mailing list