Davide Bassano <bassano.davide(a)gmail.com> writes:
> Dear Mr/Mrs
>
>
>
> I have just started working with PyCuda and I have a simple question: how
> can I parallel a Python code if PyCuda wants a kernel written in C?
>
>
>
> Let me clarify: I have a Python code (with classes and other things all
> suitable with Python and unsuitable with C). I have 256 independent for
> loops that I want to parallelize. These loops contain Python code that
> can’t be translated to C. So I tried using PyCuda package but it turned out
> that the kernel must be written in C.
>
> How can I parallelize an actual Python code with PyCuda package without
> translating my code to C?
You could try using the gpuarray functionality built into PyCUDA. Some
numpy-based codes can be effectively made GPU-aware through it.
Andreas
Dear Mr/Mrs
I have just started working with PyCuda and I have a simple question: how
can I parallel a Python code if PyCuda wants a kernel written in C?
Let me clarify: I have a Python code (with classes and other things all
suitable with Python and unsuitable with C). I have 256 independent for
loops that I want to parallelize. These loops contain Python code that
can’t be translated to C. So I tried using PyCuda package but it turned out
that the kernel must be written in C.
How can I parallelize an actual Python code with PyCuda package without
translating my code to C?
Thanks,
Best regards,
Davide Bassano
黄 瓒 <dreaming_hz(a)hotmail.com> writes:
> Hi All,
>
> @inducer<https://github.com/inducer> THANK YOU for providing PyCUDA.
>
> As cudaMalloc could be time-consuming, it seems even slicing would include such operation in PyCUDA, are there any tricks to avoid frequent gpu memory operation in PyCUDA?
Slicing a GPUArray involves no allocations. PyCUDA includes a memory
pool which can help avoid redundant allocation.
Andreas
Hi All,
@inducer<https://github.com/inducer> THANK YOU for providing PyCUDA.
As cudaMalloc could be time-consuming, it seems even slicing would include such operation in PyCUDA, are there any tricks to avoid frequent gpu memory operation in PyCUDA?
Regards,
Peter
Noah Young <npyoung(a)stanford.edu> writes:
> I'm trying to run jobs on several GPUs at the same time using multiple
> threads, each with its own context. Sometimes this works flawlessly, but
> ~75% of the time I get a cuModuleLoadDataEx error telling me the context
> has been destroyed. What's frustrating is that nothing changes between
> failed and successful runs of the code. From what I can tell it's down to
> luck whether or not the error comes up:
"Context destroyed" is akin to a segmentation fault on the CPU. You
should find evidence that your code performed an illegal access, e.g.,
using 'dmesg' in the kernel log. (If you see a message "NVRM Xid ...",
that points to the problem) My first suspicion would be a bug in your
code.
Andreas
I'm trying to run jobs on several GPUs at the same time using multiple
threads, each with its own context. Sometimes this works flawlessly, but
~75% of the time I get a cuModuleLoadDataEx error telling me the context
has been destroyed. What's frustrating is that nothing changes between
failed and successful runs of the code. From what I can tell it's down to
luck whether or not the error comes up:
~/anaconda3/lib/python3.6/site-packages/pycuda/compiler.py in
__init__(self, source, nvcc, options, keep, no_extern_c, arch, code,
cache_dir, include_dirs) 292 293 from pycuda.driver
import module_from_buffer--> 294 self.module =
module_from_buffer(cubin) 295 296 self._bind_module()
LogicError: cuModuleLoadDataEx failed: context is destroyed -
I start by making the contexts
from pycuda import driver as cuda
cuda.init()
contexts = []
for i in range(cuda.Device.count()):
c = cuda.Device(i).make_context()
c.pop()
contexts.append(c)
... and setting up a function to use each context, i.e.
import numpy as np
def do_work(ctx):
with Acquire(ctx):
a = gpuarray.to_gpu(np.random.rand(100, 400, 400))
b = gpuarray.to_gpu(np.random.rand(100, 400, 400))
for _ in range(10):
c = (a + b) / 2
out = c.get()
return out
where `Acquire` is a context manager that handles pushing and popping:
class Acquire:
def __init__(self, context):
self.ctx = context
def __enter__(self):
self.ctx.push()
return self.ctx
def __exit__(self, type, value, traceback):
self.ctx.pop()
and here I run the code in parallel using a pool of threaded workers via
joblib
from joblib import Parallel, delayed
pool = Parallel(n_jobs=len(contexts), verbose=8, prefer='threads')
with pool:
# Pass 1
sum(pool(delayed(do_work)(ctx) for ctx in contexts))
# Pass 2
sum(pool(delayed(do_work)(ctx) for ctx in contexts))
Note that I do several "passes" of work (I'll need to do 50 or so in my
real application) with the same thread pool. It seems like the crash always
happens somewhere in the second pass, or not at all. Any ideas about how to
keep my contexts from getting destroyed?
*System info*
Ubuntu 16.04 (Amazon Deep Learning AMI)
CUDA driver version 396.44
4x V100 GPUs
Python 3.6
pycuda version 2018.1.1