Hi
I have been trying to get started with the glinterop.py example on the
wiki (using pycuda 2011.2.2 built from git), but it produces
File "GlInterop.py", line 153, in process
dest_mapping.device_ptr())
File "/usr/local/lib/python2.6/dist-packages/pycuda-2011.2.2-
py2.6-linux-x86_64.egg/pycuda/driver.py",
line 429, in function_prepared_call
func.param_set_texref(texref)
ArgumentError: Python argument types in
Function.param_set_texref(Function, int)
did not match C++ signature:
param_set_texref(pycuda::function {lvalue}, pycuda::texture_reference)
along with several DeprecationWarnings. Has anyone recently had better
success with Py{GL, CUDA} integration?
I did try modifying the C++ param_set_texref to take an int, cast it
as a reference but that lead only to segfaults..
Thanks,
Marmaduke
Frédéric Bastien <nouiz(a)nouiz.org> writes:
> Hi,
>
> I made a PR that use Context.attach() to fix the problem. If you want,
> it could be useful to add it to the documentation.
What's a PR? But in any case--any addition to the docs is most welcome!
Andreas
Hi,
I made a PR that use Context.attach() to fix the problem. If you want,
it could be useful to add it to the documentation.
thanks
Fred
On Mon, Jun 11, 2012 at 2:45 PM, Frédéric Bastien <nouiz(a)nouiz.org> wrote:
> thanks
>
> I'll try to finish it this week.
>
> Fred
>
> On Thu, Jun 7, 2012 at 6:11 PM, Andreas Kloeckner
> <lists(a)informa.tiker.net> wrote:
>> Frédéric Bastien <nouiz(a)nouiz.org> writes:
>>
>>> Hi,
>>>
>>> I test it, but I have this error:
>>>
>>> File "/u/bastienf/repos/Theano/theano/misc/tests/test_pycuda_utils.py",
>>> line 4, in <module>
>>> import theano.misc.pycuda_init
>>> File "/u/bastienf/repos/Theano/theano/misc/pycuda_init.py", line 34,
>>> in <module>
>>> import pycuda.autoinit
>>> File "/u/bastienf/repos/pycuda.git/build.old/lib.linux-x86_64-2.7/pycuda/autoinit.py",
>>> line 1, in <module>
>>> import pycuda.driver as cuda
>>> File "/u/bastienf/repos/pycuda.git/build.old/lib.linux-x86_64-2.7/pycuda/driver.py",
>>> line 545, in <module>
>>> _add_functionality()
>>> File "/u/bastienf/repos/pycuda.git/build.old/lib.linux-x86_64-2.7/pycuda/driver.py",
>>> line 525, in _add_functionality
>>> Function._param_set = function_param_set_pre_v4
>>> NameError: global name 'function_param_set_pre_v4' is not defined
>>>
>>>
>>> If I change pycuda code to take the other branch from the if, I get this error:
>>>
>>>
>>> File "/u/bastienf/repos/Theano/theano/misc/tests/test_pycuda_utils.py",
>>> line 57, in test_to_cudandarray
>>> px = pycuda.gpuarray.zeros((3,4,5), 'float32')
>>> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/gpuarray.py",
>>> line 795, in zeros
>>> result.fill(zero)
>>> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/gpuarray.py",
>>> line 507, in fill
>>> value, self.gpudata, self.mem_size)
>>> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/driver.py",
>>> line 480, in function_prepared_async_call
>>> func._launch_kernel(grid, block, arg_buf, shared_size, stream)
>>> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/driver.py",
>>> line 486, in function___getattr__
>>> return self.get_attribute(getattr(function_attribute, name.upper()))
>>> AttributeError: type object 'function_attribute' has no attribute
>>> '_LAUNCH_KERNEL'
>>>
>>> Do you know how to fix those errors?
>>
>> Looks like your email fell through the cracks--sorry! This should be
>> fixed in git.
>>
>> Andreas
Hi,
How do someone find beforehand whether CUDA architecture is enough to
execute his project/program ?
I am looking to do topological sort on GPU. I think this tesla archi. have 4
GPU(128 core each). How do I decide whether it is sufficient to process big
graph on it ?
If threads are not enough then can we assign computation of some chunks to
one thread (instead of 1 chunk per thread). This sure sounds easy, but needs
some work or maybe it can't be done.
Any ideas on how to find beforehand whether big Graph and can be executed on
it ?
I am also looking and if I found something relevant I will post it.
--
View this message in context: http://pycuda.2962900.n2.nabble.com/how-to-decide-if-a-certain-CUDA-archi-i…
Sent from the PyCuda mailing list archive at Nabble.com.
Frédéric Bastien <nouiz(a)nouiz.org> writes:
> Hi,
>
> I test it, but I have this error:
>
> File "/u/bastienf/repos/Theano/theano/misc/tests/test_pycuda_utils.py",
> line 4, in <module>
> import theano.misc.pycuda_init
> File "/u/bastienf/repos/Theano/theano/misc/pycuda_init.py", line 34,
> in <module>
> import pycuda.autoinit
> File "/u/bastienf/repos/pycuda.git/build.old/lib.linux-x86_64-2.7/pycuda/autoinit.py",
> line 1, in <module>
> import pycuda.driver as cuda
> File "/u/bastienf/repos/pycuda.git/build.old/lib.linux-x86_64-2.7/pycuda/driver.py",
> line 545, in <module>
> _add_functionality()
> File "/u/bastienf/repos/pycuda.git/build.old/lib.linux-x86_64-2.7/pycuda/driver.py",
> line 525, in _add_functionality
> Function._param_set = function_param_set_pre_v4
> NameError: global name 'function_param_set_pre_v4' is not defined
>
>
> If I change pycuda code to take the other branch from the if, I get this error:
>
>
> File "/u/bastienf/repos/Theano/theano/misc/tests/test_pycuda_utils.py",
> line 57, in test_to_cudandarray
> px = pycuda.gpuarray.zeros((3,4,5), 'float32')
> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/gpuarray.py",
> line 795, in zeros
> result.fill(zero)
> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/gpuarray.py",
> line 507, in fill
> value, self.gpudata, self.mem_size)
> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/driver.py",
> line 480, in function_prepared_async_call
> func._launch_kernel(grid, block, arg_buf, shared_size, stream)
> File "/u/bastienf/repos/pycuda.git/build/lib.linux-x86_64-2.7/pycuda/driver.py",
> line 486, in function___getattr__
> return self.get_attribute(getattr(function_attribute, name.upper()))
> AttributeError: type object 'function_attribute' has no attribute
> '_LAUNCH_KERNEL'
>
> Do you know how to fix those errors?
Looks like your email fell through the cracks--sorry! This should be
fixed in git.
Andreas
Thomas Wiecki <Thomas_Wiecki(a)brown.edu> writes:
> As this seems to be the codepy/cgen thread I thought I'd tack this on here.
>
> I want to port thrust code that is a little bit more involved than the sort
> example. Namely the example code for summary statistics (
> http://code.google.com/p/thrust/source/browse/examples/summary_statistics.cu
> )
>
> I think I would be able to port all of this using the appropriate cgens
> (e.g. Struct, Template) with some tinkering. However, I wonder if it is
> really necessary to port everything. Is it possible to wrap only the parts
> I need access to and include the others as one big string.
>
> I suppose alternatively I could compute the summary stats quite easily with
> gpuarray as well. Is there likely to be a performance difference? It seems
> that that would be easier.
PyCUDA's gpuarray reduction can do the same thing, and will likely be
much easier to get going. Currently, it seems that PyCUDA's reduction is
about 3x slower on structs, but I haven't yet figured out why. I'd
appreciate help with that, though.
Andreas
As this seems to be the codepy/cgen thread I thought I'd tack this on here.
I want to port thrust code that is a little bit more involved than the sort
example. Namely the example code for summary statistics (
http://code.google.com/p/thrust/source/browse/examples/summary_statistics.cu
)
I think I would be able to port all of this using the appropriate cgens
(e.g. Struct, Template) with some tinkering. However, I wonder if it is
really necessary to port everything. Is it possible to wrap only the parts
I need access to and include the others as one big string.
I suppose alternatively I could compute the summary stats quite easily with
gpuarray as well. Is there likely to be a performance difference? It seems
that that would be easier.
Thomas
On Thu, May 31, 2012 at 11:31 AM, Bryan Catanzaro <bcatanzaro(a)acm.org>wrote:
> Yup, it can make a difference. =)
>
> The trick you mention for conjugate gradient works because the only
> thing control flow has to know is whether to launch another iteration
> - but it doesn't need to know what to do during that iteration. The
> actual work to be performed in each iteration of CG is independent of
> the state of the solver. This isn't the case for many other important
> optimization problems, where the next optimization step depends on the
> value of the result of the current step.
>
> - bryan
>
> On Thu, May 31, 2012 at 8:18 AM, Andreas Kloeckner
> <lists(a)informa.tiker.net> wrote:
> > Bryan Catanzaro <bcatanzaro(a)acm.org> writes:
> >
> >> I agree that data size matters in these discussions. But I think the
> >> right way to account for it is show performance at a range of data
> >> sizes, as measured from Python.
> >>
> >> The assumption that you'll keep the GPU busy isn't necessarily true.
> >> thrust::reduce, for example (which max_element uses internally),
> >> launches a big kernel, followed by a small kernel to finish the
> >> reduction tree, followed by a cudaMemcpy to transfer the result back
> >> to the host. The GPU won't be busy during the small kernel, nor
> >> during the cudaMemcpy, nor during the conversion back to Python, etc.
> >> Reduce is often used to make control flow decisions in optimization
> >> loops, where you don't know what the next optimization step to be
> >> performed is until the result is known, and so you can't launch the
> >> work speculatively. If the control flow is performed in Python, all
> >> these overheads are exposed to application performance - so I think
> >> they matter.
> >
> > Glad you brought that up. :) The conjugate gradient solver in PyCUDA
> > addresses exactly that by simply running iterations as fast as it can
> > and shepherding the residual results to the host on their own time,
> > deferring convergence decisions until the data is available. That was
> > good for a 20% or so gain last time I measured it (on a GT200).
> >
> > Andreas
> >
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
Hi everyone,
This has been causing me problems for a few weeks now, and I'm hoping
someone would be able to shed some light on it. I need to run some
CPU-intensive tasks in the background while launching GPU kernels in the
main loop of a project I'm working on, so I've been trying to offload to
a multiprocessing process. But it seems what whenever I try to launch a
kernel while the background process is active, the kernel fails to get
the correct results (does not throw any errors). Once the kernel
returns wrong results once, it continues to fail for the remainder of
the run, even if the background process has already finished and joined.
I've put together a small code sample to demonstrate this [attached].
Is this known behaviour, and if so, is there any workaround I can use?
Or am I doing something completely wrong?
Thanks,
Brendan Wood
When I recently tried running the sample program on the main codepy
documentation page, I noticed that I had to manually add the python
library (python2.7 in my case) to the list of libraries guessed by
codepy.jit.guess_toolchain in order to get the program to run. Any
ideas as to why the toolchain guesser isn't finding the python library
by default on my system? The python library on my system is located at
/usr/lib64/libpython2.7.so.
I'm using codepy 2012.1.2, cgen 2012.1, python 2.7, gcc 4.6.1, and
boost 1.46.1 on 64-bit Linux.
L.G.
David Eklund <deklund(a)gmail.com> writes:
> We have a persistent problem attempting to multithread using pycuda. I have
> a thread pool with one thread per GPU, each one initializes its own context
> with its given device ID and waits to read jobs from a common Queue object.
> The main thread processes requests and adds CUDA related jobs to the Queue.
> This works well enough and utilizes all available GPUs but we frequently
> run into a locking issue when issuing lots of relatively fast cuda calls
> where one computation will hang indefinitely. When the contexts are created
> with the pycuda.driver.ctx_flags.SCHED_BLOCKING_SYNC flag and I attach to a
> hung process I find it's waiting on a semaphore in cuCtxSynchronize in
> libcuda.so; when the contexts are created without the SCHED_BLOCKING_SYNC
> flag I find its still stuck in cuCtxSynchronize but in a spin loop waiting
> for results.
>
> I have an alternative version with all the same code but bypassing pycuda
> and calling directly into an nvcc compiled shared library using ctypes that
> uses cudaSetDevice and cudaDeviceSynchronize rather than the cuCtx*
> functions and it does not experience these same locking issues.
This looks much like an Nvidia bug--I really don't know what PyCUDA
could be doing to prompt this sort of behavior. Do you get the same
behavior if you use multiple processes? Anyway, it might be worth
pinging Nvidia over this. It'd also be helpful if you could post a
minimal program that reproduces this. Also, what driver version?
> Has anyone ran into this kind of issue before? Also, is there support in
> pycuda (or planned support for future releases) to use cudaDevice*
> functions rather than explicit context management?
cuda* functions are from the so-called 'run-time API', whereas PyCUDA
uses the cu* functions, which form the so-called 'driver API'.
Andreas