Not sure about the CUDA limitations, I'll let others speak to that...
But in developing the mne-python CUDA filtering code, IIRC the primary
limitation was (by far) transferring the data to and from the GPU. The FFT
computations themselves were a fraction of the total time. I suspect using
multiple jobs won't help CUDA filtering very much since the jobs would
presumably compete for the same memory bandwidth, but I would love to be
wrong about this. If it works better, it would be great to open an
mne-python issue for it, as we are always looking for speedups :)
On Nov 1, 2014 7:21 PM, "kjs" <bfb(a)riseup.net> wrote:
I have written an MPI routine in Python that sends jobs to N worker
processes. The root process handles file IO and the workers do
computation. In the worker processes calls are made to the cuda enabled
GPU to do FFTs.
Is it safe to have N processes potentially making calls to the same GPU
at the same time? I have not made any amendments to the cuda code,
and have little knowledge of what could possibly go wrong.
 I am using python-mne with cuda enabled to call scikits.cuda.fft
PyCUDA mailing list
Thanks Andreas, this is good to know. I noticed that even though pycuda
is currently only using one of two GPUs, that GPU is only ever at ~35%
memory and ~22% processing utilization. This could be related to Eric's
observation that the PCI-e 16x bus bandwidth reaches capacity while the
GPU is pushing out fast FFT'ed arrays. Thus allowing for only one or two
arrays in the GPU at the same time.
From what I have seen, using cuda speeds up my FFTs ~2x. Though, the
workers do many other computations on the CPU. It's a worst case
scenario that all N workers are trying to send data to the GPU at the