[Eeglablist] GPUs and EEGLAB
Arnaud Delorme
arno at ucsd.edu
Tue Oct 5 13:07:27 PDT 2010
This message is from Daniel Gardner concerning the following GPU wiki page:
http://sccn.ucsd.edu/wiki/GPU
Seconding and confirming Arno's recent analysis
of GPU-enabled MATLAB for EEGLAB analysis, I
agree that realizing the very great potential of
GPUs for neurophysiology requires very careful
analysis and programming that generic packages
are unlikely to achieve.
Jonathan Victor and I have begun a project to
extend our neuroanalysis.org
information-theoretic and other spike train
analytics to the GPU platform. Our experience,
that of others in our project who have used GPUs
for other areas of biomedicine, and our TC Dublin
colleague Conor Houghton, all confirm that
generic or library-based solutions show only
modest performance gains.
Getting more than an order of magnitude
improvement over standard twin-core processors
requires very careful utilization of GPU
architecture, including thread assignments and
memory utilization, and this needs thoughtful
analysis and manual recoding of parallel routines.
Of course this has to begin with a careful
analysis that identifies bottlenecks, but
bottlenecks that are good code targets for
GPU-derived processing. In addition to being
computationally intensive, these should be
parsable into large numbers of threads, larger
than the number of available cores. Each thread
should have a good number of independent
computations that utilize on-chip memory as much
as possible. The trick is finding the algorithms
for which the thread, core, and memory GPU
architecture actually facilitates the needed
computation. Data fetches, and bus transfers
between the CPU-controlled part of a computer and
the GPU slow things down, so the ideal case for
speedup is where data gets transferred to the GPU
and long calculations are done in parallel by
hundreds of cores rather than serially (or in
parallel by a small number of CPU cores). Also,
the very new 448-core GPUs allow efficient use of
many threads per core up to an efficiency limit
of several thousand simultaneous threads.
Although we have just begun, we are willing to
share our ongoing experience with the EEGLAB
community over time; we begin with these initial
suggestions. The analysis posted appears to have
been carried out on an NVDIA C1060 (there is no
2060), but we target these to the newer NVIDIA
GPU, the C2050 and its native CUDA development
environment:
- Efficient launch kernel processes to optimize
instruction throughput, including straightforward
execution paths for each warp,
- Design code so that flow control instructions
(ifŠelse, for, do, Š) control multi-thread warps,
rather than individual threads,
- Use GPU card shared memory efficiently and
appropriately, enabling block (multi-thread)
multi-parameter-dependent and paired-trace
calculations, as well as keeping values within
registers to avoid GPU card 'local' (on-card but
off-chip thread-specific) memory transfers, and
- Leverage the complex structured hierarchy of
each of several components of the GPU-derived
architecture so that memory transfers, bank
utilization, thread-per-kernel and
thread-per-block execution, are performed in the
smallest number of clock cycles.
- Select the appropriate CUDA runtime math
library (one is optimized for speed, the other
for enhanced precision) for each routine.
- Be prepared for multiple cycles of optimization.
--
...Daniel Gardner
________________________________________________
| dan at med.cornell.edu
| dg458 at columbia.edu
|________________________________________________
| Dr. Daniel Gardner
| Professor of Physiology & Biophysics
| Head, Laboratory of Neuroinformatics - D-404
| Weill Medical College of Cornell University
| 1300 York Avenue voice: (212) 746-6373
| New York, NY 10065 USA fax: (212) 746-8355
| US cell: +1 917 902-0654
| UK mobile: +44 (0) 7817 423 348
|________________________________________________
More information about the eeglablist
mailing list