[Eeglablist] Running AMICA in a super-computer
Tyler Grummett
tyler.grummett at flinders.edu.au
Wed Jul 26 00:07:17 PDT 2017
Dear Jason,
Thank you for taking the time to reply to my email.
Im running the jobs on my university's super computer. To my knowledge we dont do cluster computing, each computer has a high number of processors. So each job that is run asks for a certain number of threads (or slots) and it is run on a single computer with the number of processors requested.
The first few lines in the AMICA output are:
1 processor name = c106.csem.flinders.edu.au
1 host_num = 190898383
This is MPI process 1 of 1 ; I am process 1 of
1 on node: c106.csem.flinders.edu.au
1 : node root process 1 of 1
The qsub command that I run is:
qsub -t 1:93:1 -m e -M grum0003 at flinders.edu.au -S /bin/bash -cwd -q all.q -l matlab=1 -pe threaded 16 Scripts/AMICA_script.txt 'Control' 'phd' 'false' 'false'
The matlab line of code that is run is:
[ EEG_merged.icaweights, EEG_merged.icasphere, ~] = runamica15( EEG_merged.datfile, ...
'pcakeep', PCA_comps, 'num_chans', EEG_merged.nbchan, 'max_threads', Nthreads, ...
'max_iter', max_iter, 'outdir', amicaout_folderpath);
The 'amica_win' folder is copied to the subject's directory, added to the path, and then deleted after the function is finished to ensure that no other process is using the same function. Currently I am running 45 subjects on our supercomputer using 16 threads each, and only one of them has crashed. When I was using 8 threads on each job, I was able to run 90 subjects at once, which resulted in a larger number of crashes. So it does appear as though the more subjects (jobs) I run, the more it crashes, even though each job is using its own runamica15 and outputting to a folder which is unique to it.
Hopefully I have provided enough information. Let me know if I havent.
Regards,
Tyler
*************************
Tyler Grummett ( BBSc, BSc(Hons I))
PhD Candidate
Brain Signals Laboratory
Flinders University
Rm 5A301
Ext 66125
________________________________
From: Jason Palmer <japalmer29 at gmail.com>
Sent: Wednesday, 26 July 2017 4:44:37 AM
To: Tyler Grummett; 'EEGLABLIST'
Subject: RE: [Eeglablist] Running AMICA in a super-computer
Hi Tyler,
Are you running this on the SCCN cluster or your own cluster? Could you tell me what command you are using to run amica in matlab, and if it is your own cluster setup, what mpirun or mpiexec command is running, whether it is sge or torque and what if any qsub script is being run?
Also the first lines of the stdout generated by Amica where it identifies the processes and their respective nodes …
I wonder if you are actually running multiple (overlapping) copies of the same run on several nodes, as can happen with the mpirun commands in some environment setups, instead of having the processes run in concert and communicating.
Best,
Jason
From: eeglablist [mailto:eeglablist-bounces at sccn.ucsd.edu] On Behalf Of Tyler Grummett
Sent: Monday, July 24, 2017 9:00 AM
To: EEGLABLIST
Subject: [Eeglablist] Running AMICA in a super-computer
Dear eeglab,
Recently Im facing a problem when running multiple AMICAs in a super-computer at once, where I get the following error:
...
iter 258 lrate = 1.0000000000 LL = -0.0238247322 nd = 0.0001777274, D = 0.89129E-01 0.89129E-01 ( 33.34 s, 182.8 h)
iter 259 lrate = 1.0000000000 LL = -0.0238213053 nd = 0.0001768634, D = 0.89103E-01 0.89103E-01 ( 34.18 s, 187.4 h)
iter 260 lrate = 1.0000000000 LL = -0.0238178712 nd = 0.0001762253, D = 0.89076E-01 0.89076E-01 ( 33.08 s, 181.4 h)
forrtl: Permission denied
forrtl: severe (28): CLOSE error, unit 19, file "Unknown"
Image PC Routine Line Source
amica15ub 00000000010D4003 Unknown Unknown Unknown
amica15ub 00000000010D1A6D Unknown Unknown Unknown
amica15ub 0000000000441780 Unknown Unknown Unknown
amica15ub 0000000000419E08 Unknown Unknown Unknown
amica15ub 00000000004021DE Unknown Unknown Unknown
amica15ub 000000000118C1A4 Unknown Unknown Unknown
amica15ub 00000000004020C1 Unknown Unknown Unknown
However, this error doesn the error doesnt generate an error in matlab so it tries to proceed forward and uses the ICA weights from the most recent iteration. Obviously this isnt ideal, as AMICA requires a lot of iterations to get good ICs.
Does anyone have any idea how to fix this or find a work around? I can confirm that there are no jobs using the same temporary file, temporary output folder, or any of the functions that are run in AMICA. They are all copied to their own folder and run independently. At least thats what I hope is happening.
Im running out of things to try.
Regards,
Tyler
*************************
Tyler Grummett ( BBSc, BSc(Hons I))
PhD Candidate
Brain Signals Laboratory
Flinders University
Rm 5A301
Ext 66125
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://sccn.ucsd.edu/pipermail/eeglablist/attachments/20170726/931c62f7/attachment.html>
More information about the eeglablist
mailing list