I am trying to implement in Python the following pattern for **multi-CPU and
single-GPU** computation using **pycuda** and **pyfft** packages.
I would like to have **several processes** (e.g. launched with
multiprocessing.Pool()), with **each of them** able to perform **FFTs using
the GPU (using NVIDIA CUDA)**.
However, I have the following problem:
If I run too many processes or too many FFTs per process, **the overall
script remains on hold without terminating** (and without computing all the
FFTs that are due). From further investigations I suppose this is due to the
**memory limit** on the GPU (currently 2048MB on NVIDIA GeForce GT 750M). It
seems that the multiprocessing pool is not able to acquire the control back.
Is there any way to avoid this?
Since each process requires less than 2048 MB, I would like to have
something like a **queue** where each process can *book* the usage of the
GPU and, when a process releases the context, the next process in the queue
starts using it.
Is this doable?
Alternatively, is it possible to force that only one process uses the GPU at
a given time?
I have tried separately these solutions but they do not work (or probably I
have not implemented them correctly):
1. synchronize the stream, with proc_stream.synchronize()
2. clear context cache, with pycuda.tools.clear_context_caches()
3. change the compute mode, with cuda.compute_mode =
cuda.compute_mode.EXCLUSIVE
**Note:** The solution 2. seems to free some memory, but it makes the
computation way slower, and does not solve the problem: e.g. increasing the
number of ffts to be computed, the script shows the same behaviour.
Here the code. To start from a simple task, here each process computes 1 FFT
(then one can use batch option in execute() to do more FFTs in a row).
import multiprocessing
import pycuda.driver as cuda
import pycuda.gpuarray as gpuarray
from pycuda.tools import make_default_context
from pyfft.cuda import Plan
def main():
# generates simple matrix, (e.g. image with a signal at the center)
size = 4096
center = size/2
in_matrix = np.zeros((size, size), dtype='complex64')
in_matrix[center:center+2, center:center+2] = 10.
pool_size = 4 # integer up to multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size)
func = FuncWrapper(in_matrix, size)
nffts = 16 # total number of ffts to be computed
par = np.arange(nffts)
results = pool.map(func, par)
pool.close()
pool.join()
print results
And here the function wrapper:
class FuncWrapper(object):
def __init__(self, matrix, size):
self.in_matrix = matrix
self.size = size
print("Func initialized with matrix size=%i" % size)
def __call__(self, par):
proc_id = multiprocessing.current_process().name
# take control over the GPU
cuda.init()
context = make_default_context()
device = context.get_device()
proc_stream = cuda.Stream()
# move data to GPU
# multiplication self.in_matrix*par is just to have each process
computing
# different matrices
in_map_gpu = gpuarray.to_gpu(self.in_matrix*par)
# create Plan, execute FFT and get back the result from GPU
plan = Plan((self.size, self.size), dtype=np.complex64,
fast_math=False, normalize=False,
wait_for_finish=True,
stream=proc_stream)
plan.execute(in_map_gpu, wait_for_finish=True)
result = in_map_gpu.get()
# free memory on GPU
del in_map_gpu
mem = np.array(cuda.mem_get_info())/1.e6
print("%s free=%f\ttot=%f" % (proc_id, mem[0], mem[1]))
# release context
context.pop()
return par
Now, with nffts=16 and pool_size=4 the script terminates correctly and gives
this output:
Func initialized with matrix size=4096
PoolWorker-1 free=1481.019392 tot=2147.024896
PoolWorker-2 free=1331.011584 tot=2147.024896
PoolWorker-3 free=1181.003776 tot=2147.024896
PoolWorker-4 free=1030.631424 tot=2147.024896
PoolWorker-1 free=881.074176 tot=2147.024896
PoolWorker-2 free=731.746304 tot=2147.024896
PoolWorker-3 free=582.418432 tot=2147.024896
PoolWorker-4 free=433.090560 tot=2147.024896
PoolWorker-1 free=582.754304 tot=2147.024896
PoolWorker-2 free=718.946304 tot=2147.024896
PoolWorker-3 free=881.254400 tot=2147.024896
PoolWorker-4 free=1030.684672 tot=2147.024896
PoolWorker-1 free=868.028416 tot=2147.024896
PoolWorker-2 free=731.713536 tot=2147.024896
PoolWorker-3 free=582.402048 tot=2147.024896
PoolWorker-4 free=433.090560 tot=2147.024896
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
But with nffts=18 and pool_size=4 the script does not terminate and gives
this output, remaining stuck at the last line:
Func initialized with matrix size=4096
PoolWorker-1 free=1416.392704 tot=2147.024896
PoolWorker-2 free=982.544384 tot=2147.024896
PoolWorker-1 free=1101.037568 tot=2147.024896
PoolWorker-2 free=682.991616 tot=2147.024896
PoolWorker-3 free=815.747072 tot=2147.024896
PoolWorker-4 free=396.918784 tot=2147.024896
PoolWorker-3 free=503.046144 tot=2147.024896
PoolWorker-4 free=397.144064 tot=2147.024896
PoolWorker-1 free=531.361792 tot=2147.024896
PoolWorker-1 free=397.246464 tot=2147.024896
PoolWorker-2 free=518.610944 tot=2147.024896
PoolWorker-2 free=397.021184 tot=2147.024896
PoolWorker-3 free=517.193728 tot=2147.024896
PoolWorker-4 free=397.021184 tot=2147.024896
PoolWorker-3 free=504.336384 tot=2147.024896
PoolWorker-4 free=149.123072 tot=2147.024896
PoolWorker-1 free=283.340800 tot=2147.024896
...on hold...
Many thanks for your help!
--
View this message in context: http://pycuda.2962900.n2.nabble.com/Compute-several-FFT-with-GPU-using-Pyth…
Sent from the PyCuda mailing list archive at Nabble.com.
I am trying to implement in Python the following pattern for *multi-CPU and
single-GPU* computation using *pycuda* and *pyfft* packages.I would like to
have *several processes* (e.g. launched with multiprocessing.Pool()), with
*each of them* able to perform *FFTs using the GPU (using NVIDIA
CUDA)*.However, I have the following problem:If I run too many processes or
too many FFTs per process, *the overall script remains on hold without
terminating* (and without computing all the FFTs that are due). From further
investigations I suppose this is due to the *memory limit* on the GPU
(currently 2048MB on NVIDIA GeForce GT 750M). It seems that the
multiprocessing pool is not able to acquire the control back.Is there any
way to avoid this?Since each process requires less than 2048 MB, I would
like to have something like a *queue* where each process can /book/ the
usage of the GPU and, when a process releases the context, the next process
in the queue starts using it.Is this doable?Alternatively, is it possible to
force that only one process uses the GPU at a given time? I have tried
separately these solutions but they do not work (or probably I have not
implemented them correctly): 1. synchronize the stream, with
proc_stream.synchronize() 2. clear context cache, with
pycuda.tools.clear_context_caches() 3. change the compute mode, with
cuda.compute_mode = cuda.compute_mode.EXCLUSIVE*Note:* The solution 2. seems
to free some memory, but it makes the computation way slower, and does not
solve the problem: e.g. increasing the number of ffts to be computed, the
script shows the same behaviour.Here the code. To start from a simple task,
here each process computes 1 FFT (then one can use batch option in execute()
to do more FFTs in a row). import multiprocessing import
pycuda.driver as cuda import pycuda.gpuarray as gpuarray from
pycuda.tools import make_default_context from pyfft.cuda import Plan
def main(): # generates simple matrix, (e.g. image with a signal at
the center) size = 4096 center = size/2 in_matrix =
np.zeros((size, size), dtype='complex64') in_matrix[center:center+2,
center:center+2] = 10. pool_size = 4 # integer up to
multiprocessing.cpu_count() pool =
multiprocessing.Pool(processes=pool_size) func =
FuncWrapper(in_matrix, size) nffts = 16 # total number of ffts to be
computed par = np.arange(nffts) results = pool.map(func,
par) pool.close() pool.join() print resultsAnd here
the function wrapper: class FuncWrapper(object): def
__init__(self, matrix, size): self.in_matrix = matrix
self.size = size print("Func initialized with matrix size=%i" %
size) def __call__(self, par): proc_id =
multiprocessing.current_process().name # take control
over the GPU cuda.init() context =
make_default_context() device = context.get_device()
proc_stream = cuda.Stream() # move data to GPU #
multiplication self.in_matrix*par is just to have each process computing
# different matrices in_map_gpu =
gpuarray.to_gpu(self.in_matrix*par) # create Plan, execute
FFT and get back the result from GPU plan = Plan((self.size,
self.size), dtype=np.complex64, fast_math=False,
normalize=False, wait_for_finish=True,
stream=proc_stream) plan.execute(in_map_gpu,
wait_for_finish=True) result = in_map_gpu.get() #
free memory on GPU del in_map_gpu mem =
np.array(cuda.mem_get_info())/1.e6 print("%s free=%f\ttot=%f" %
(proc_id, mem[0], mem[1])) # release context
context.pop() return parNow, with nffts=16 and pool_size=4
the script terminates correctly and gives this output: Func initialized
with matrix size=4096 PoolWorker-1 free=1481.019392 tot=2147.024896
PoolWorker-2 free=1331.011584 tot=2147.024896 PoolWorker-3
free=1181.003776 tot=2147.024896 PoolWorker-4 free=1030.631424
tot=2147.024896 PoolWorker-1 free=881.074176 tot=2147.024896
PoolWorker-2 free=731.746304 tot=2147.024896 PoolWorker-3 free=582.418432
tot=2147.024896 PoolWorker-4 free=433.090560 tot=2147.024896
PoolWorker-1 free=582.754304 tot=2147.024896 PoolWorker-2 free=718.946304
tot=2147.024896 PoolWorker-3 free=881.254400 tot=2147.024896
PoolWorker-4 free=1030.684672 tot=2147.024896 PoolWorker-1
free=868.028416 tot=2147.024896 PoolWorker-2 free=731.713536
tot=2147.024896 PoolWorker-3 free=582.402048 tot=2147.024896
PoolWorker-4 free=433.090560 tot=2147.024896 [0, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15] But with nffts=18 and pool_size=4 the script
does not terminate and gives this output, remaining stuck at the last line:
Func initialized with matrix size=4096 PoolWorker-1 free=1416.392704
tot=2147.024896 PoolWorker-2 free=982.544384 tot=2147.024896
PoolWorker-1 free=1101.037568 tot=2147.024896 PoolWorker-2
free=682.991616 tot=2147.024896 PoolWorker-3 free=815.747072
tot=2147.024896 PoolWorker-4 free=396.918784 tot=2147.024896
PoolWorker-3 free=503.046144 tot=2147.024896 PoolWorker-4 free=397.144064
tot=2147.024896 PoolWorker-1 free=531.361792 tot=2147.024896
PoolWorker-1 free=397.246464 tot=2147.024896 PoolWorker-2 free=518.610944
tot=2147.024896 PoolWorker-2 free=397.021184 tot=2147.024896
PoolWorker-3 free=517.193728 tot=2147.024896 PoolWorker-4 free=397.021184
tot=2147.024896 PoolWorker-3 free=504.336384 tot=2147.024896
PoolWorker-4 free=149.123072 tot=2147.024896 PoolWorker-1 free=283.340800
tot=2147.024896Many thanks for your help!
--
View this message in context: http://pycuda.2962900.n2.nabble.com/Compute-several-FFT-with-GPU-using-Pyth…
Sent from the PyCuda mailing list archive at Nabble.com.
"David A. Markowitz" <david.a.markowitz(a)gmail.com> writes:
> Many thanks Andreas, I've solved the problem now. While digging through the
> compiler.py code, I noticed a check for the PYCUDA_DEFAULT_NVCC_FLAGS
> environment variable, which is then passed to nvcc. Ultimately I was able
> to solve my problem by putting a file in /etc/profile.d/ with the contents:
>
> export PYCUDA_DEFAULT_NVCC_FLAGS="--dont-use-profile
> -ldir=/usr/local/cuda/nvvm/libdevice"
>
> This offers a simple but effective way for PyCUDA to point nvcc in the
> right direction.
>
> I'm still not sure why I wasn't able to fix the "libdevice library not
> found" error by modifying nvcc.profile directly, but the above is probably
> a better solution regardless, because it doesn't require modifications to a
> file that affects nvcc's operation in all use cases (i.e. even when it's
> not called by PyCUDA). Since I've only encountered this problem using
> PyCUDA, it makes sense that the solution should only kick in when nvcc is
> called by PyCUDA.
>
> I'm really looking forward to using this software package. Thank you for
> all of your hard work putting it together!
Glad to hear you got things to work!
Andreas
"David A. Markowitz" <david.a.markowitz(a)gmail.com> writes:
> Thanks again, Andreas. I'm really looking forward to getting started with
> PyCUDA.
>
> Unfortunately, I've already tried your suggested approach (updating
> nvcc.profile with NVVMIR_LIBRARY_DIR = /usr/local/cuda-6.5/nvvm/libdevice,
> which contains libdevice.compute_35.10.bc in my installation), but it did
> not solve the problem. When I asked NVIDIA about this directly (hoping not
> to bother you further), they told me I should never need to modify
> nvcc.profile under any circumstances, which wasn't very helpful.
>
> Is there some other line in nvcc.profile that I can modify so that nvcc
> will be able to find the appropriate libdevice library when called by
> PyCUDA's compiler script? e.g. perhaps I could append something to the
> INCLUDES or LIBRARIES variables?
>
> Is there an easy way for me to see which environment variables are
> available when PyCUDA's compiler.py code calls nvcc? This might help
> diagnose the problem.
The call happens here:
https://github.com/inducer/pycuda/blob/master/pycuda/compiler.py#L114
using this:
https://github.com/inducer/pytools/blob/master/pytools/prefork.py#L34https://docs.python.org/2.7/library/subprocess.html
says that processes should inherit the parent's environment unless
otherwise specified (and it isn't), so it's not clear to me what would
override your variable... You can also use the 'keep=True' flag and try
and run nvcc yourself in the temp directory that PyCUDA creates. That's
perhaps the best way of figuring out what's up.
Andreas
"David A. Markowitz" <david.a.markowitz(a)gmail.com> writes:
> Hi, thanks for the quick reply (and good advice!). I wiped my cuda 6.5
> installation and reinstalled from scratch. nvcc now works when called from
> the command line on simple CUDA samples. It compiles for my GPU's
> architecture (3.5) by default, so PATH and LD_LIBRARY_PATH are definitely
> configured correctly.
>
> I also wiped PyCUDA and reinstalled from scratch, per the instructions for
> Ubuntu 14.04 64 bit on the PyCUDA Installation page. No errors during this
> process, and I can successfully import pycuda.autoinit. However, now when I
> try to run any of the PyCUDA examples, I get the following error:
>
> nvcc fatal : Path to libdevice library not specified
>
> Since I do not encounter this error when compiling CUDA samples or my own
> CUDA code with nvcc, my guess is that my environment variables aren't being
> seen by PyCUDA. I googled this error and found a few threads on the
> subject, but no effective solutions.
>
> I was wondering if I could trouble this list for a pointer or two?
> Hopefully there's a quick fix.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=725649
suggests that you may be able to tweak /etc/nvcc.profile.
HTH,
Andreas
Hi, thanks for the quick reply (and good advice!). I wiped my cuda 6.5
installation and reinstalled from scratch. nvcc now works when called from
the command line on simple CUDA samples. It compiles for my GPU's
architecture (3.5) by default, so PATH and LD_LIBRARY_PATH are definitely
configured correctly.
I also wiped PyCUDA and reinstalled from scratch, per the instructions for
Ubuntu 14.04 64 bit on the PyCUDA Installation page. No errors during this
process, and I can successfully import pycuda.autoinit. However, now when I
try to run any of the PyCUDA examples, I get the following error:
nvcc fatal : Path to libdevice library not specified
Since I do not encounter this error when compiling CUDA samples or my own
CUDA code with nvcc, my guess is that my environment variables aren't being
seen by PyCUDA. I googled this error and found a few threads on the
subject, but no effective solutions.
I was wondering if I could trouble this list for a pointer or two?
Hopefully there's a quick fix.
Many thanks,
-David
On Sat, Jan 17, 2015 at 2:41 PM, Andreas Kloeckner <lists(a)informa.tiker.net>
wrote:
> "David A. Markowitz" <david.a.markowitz(a)gmail.com> writes:
>
> > Hi, I just installed PyCUDA, but test_driver.py crashes with the
> > following error:
> >
> > CompileError: nvcc compilation of /tmp/tmpNht4bp/kernel.cu failed
> > [command: nvcc --cubin -arch sm_35
> >
> -I/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/cuda
> > kernel.cu]
> > [stderr: error in open:
> > /usr/bin/../nvvm/libdevice/libdevice.compute_35.10.bc No such
> > file or directory ]
>
>
> This looks like nvcc is unable to find its own parts. Does nvcc work
> when called from the command line on a simple CUDA sample? (My guess is
> no.)
>
> Andreas
>
"David A. Markowitz" <david.a.markowitz(a)gmail.com> writes:
> Hi, I just installed PyCUDA, but test_driver.py crashes with the
> following error:
>
> CompileError: nvcc compilation of /tmp/tmpNht4bp/kernel.cu failed
> [command: nvcc --cubin -arch sm_35
> -I/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/cuda
> kernel.cu]
> [stderr: error in open:
> /usr/bin/../nvvm/libdevice/libdevice.compute_35.10.bc No such
> file or directory ]
This looks like nvcc is unable to find its own parts. Does nvcc work
when called from the command line on a simple CUDA sample? (My guess is
no.)
Andreas
Received from Shivang Ghetia on Fri, Jan 16, 2015 at 01:23:18PM EST:
> Hi PyCUDA community,
>
> I want to know whether PyCUDA support NPP library as I want to use some
> functions from NPP library. I went through documentation of CULA and
> scikits.cuda
> provides way to accss libraries like cuFFT and cuBLAS. I am not able to
> find any way to access NPP libray.
>
> Regards,
> Shivang.
scikits.cuda (which is developed independently of PyCUDA) doesn't
provide wrappers for the NPP functions yet, although I would be happy to
integrate support for them if someone writes wrappers for the functions.
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/http://lebedov.github.io/http://neurokernel.github.io/
Hi PyCUDA community,
I want to know whether PyCUDA support NPP library as I want to use some
functions from NPP library. I went through documentation of CULA and
scikits.cuda
provides way to accss libraries like cuFFT and cuBLAS. I am not able to
find any way to access NPP libray.
Regards,
Shivang.
Hi, I just installed PyCUDA, but test_driver.py crashes with the
following error:
CompileError: nvcc compilation of /tmp/tmpNht4bp/kernel.cu failed
[command: nvcc --cubin -arch sm_35
-I/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/cuda
kernel.cu]
[stderr: error in open:
/usr/bin/../nvvm/libdevice/libdevice.compute_35.10.bc No such
file or directory ]
Note the last line- it seems to think my CUDA_ROOT is "/usr/bin/..",
which is clearly incorrect!
All of my CUDA environment variables are set properly (e.g. I'm able
to compile+run the CUDA samples and my own CUDA code without
problems), so my guess is that PyCUDA ignored one or more of these
variables during installation.
I've tried forcing PyCUDA to use the correct CUDA environment
variables during installation, as follows:
sudo ./configure.py --cuda-root=/usr/local/cuda
--cudadrv-lib-dir=/usr/local/cuda/lib64
However, I still encounter the same error as above.
I would be grateful for any tips for how to address this problem.
Happy to provide more information if that would help.
Thanks,
-David