[Eeglablist] GPUs and EEGLAB

Tue Oct 5 13:07:27 PDT 2010

This message is from Daniel Gardner concerning the following GPU wiki page:

http://sccn.ucsd.edu/wiki/GPU

Seconding and confirming Arno's recent analysis 
of GPU-enabled MATLAB for EEGLAB analysis, I 
agree that realizing the very great potential of 
GPUs for neurophysiology requires very careful 
analysis and programming that generic packages 
are unlikely to achieve.

Jonathan Victor and I have begun a project to 
extend our neuroanalysis.org 
information-theoretic and other spike train 
analytics to the GPU platform. Our experience, 
that of others in our project who have used GPUs 
for other areas of biomedicine, and our TC Dublin 
colleague Conor Houghton, all confirm that 
generic or library-based solutions show only 
modest performance gains.

Getting more than an order of magnitude 
improvement over standard twin-core processors 
requires very careful utilization of GPU 
architecture, including thread assignments and 
memory utilization, and this needs thoughtful 
analysis and manual recoding of parallel routines.

Of course this has to begin with a careful 
analysis that identifies bottlenecks, but 
bottlenecks that are good code targets for 
GPU-derived processing. In addition to being 
computationally intensive, these should be 
parsable into large numbers of threads, larger 
than the number of available cores. Each thread 
should have a good number of independent 
computations that utilize on-chip memory as much 
as possible. The trick is finding the algorithms 
for which the thread, core, and memory GPU 
architecture actually facilitates the needed 
computation. Data fetches, and bus transfers 
between the CPU-controlled part of a computer and 
the GPU slow things down, so the ideal case for 
speedup is where data gets transferred to the GPU 
and long calculations are done in parallel by 
hundreds of cores rather than serially (or in 
parallel by a small number of CPU cores). Also, 
the very new 448-core GPUs allow efficient use of 
many threads per core up to an efficiency limit 
of several thousand simultaneous threads.

Although we have just begun, we are willing to 
share our ongoing experience with the EEGLAB 
community over time; we begin with these initial 
suggestions. The analysis posted appears to have 
been carried out on an NVDIA C1060 (there is no 
2060), but we target these to the newer NVIDIA 
GPU, the C2050 and its native CUDA development 
environment:
- Efficient launch kernel processes to optimize 
instruction throughput, including straightforward 
execution paths for each warp,
- Design code so that flow control instructions 
(ifŠelse, for, do, Š) control multi-thread warps, 
rather than individual threads,
- Use GPU card shared memory efficiently and 
appropriately, enabling block (multi-thread) 
multi-parameter-dependent and paired-trace 
calculations, as well as keeping values within 
registers to avoid GPU card 'local' (on-card but 
off-chip thread-specific) memory transfers, and
- Leverage the complex structured hierarchy of 
each of several components of the GPU-derived 
architecture so that memory transfers, bank 
utilization, thread-per-kernel and 
thread-per-block execution, are performed in the 
smallest number of clock cycles.
- Select the appropriate CUDA runtime math 
library (one is optimized for speed, the other 
for enhanced precision) for each routine.
- Be prepared for multiple cycles of optimization.

-- 
...Daniel Gardner
________________________________________________
| dan at med.cornell.edu
| dg458 at columbia.edu
|________________________________________________
| Dr. Daniel Gardner
| Professor of Physiology & Biophysics
| Head, Laboratory of Neuroinformatics - D-404
| Weill Medical College of Cornell University
| 1300 York Avenue                              voice: (212) 746-6373
| New York, NY 10065 USA                    fax:   (212) 746-8355
| US cell:      +1 917 902-0654
| UK mobile: +44 (0) 7817 423 348
|________________________________________________