On Montag 09 März 2009, you wrote:
> Thanks for the reply.
>
> I tried taking out the descr.flags call (I have PyCUDA 0.92, and this
> is in my test_driver.py), to no avail, I still get the LogicError for
> cuArrayCreate.
>
> (h,w) are (1,5) in my case.
>
> Here's a minimal code snippet:
>
> import pycuda.driver as cuda
> import pycuda.autoinit
> import numpy
>
> I1 = complex(0,1)
> row = numpy.asarray(numpy.random.randn(1,5) +
> I1*numpy.random.randn(1,5), dtype=numpy.complex64, order="C")
> h, w = row.shape
> descr = cuda.ArrayDescriptor()
> descr.w = w
> descr.h = h
> descr.format = cuda.array_format.FLOAT
> descr.num_channels = 2
> ary = cuda.Array(descr) ### <------------------------------- Crashes here
The descriptor attributes should be 'width' and 'height', not 'w' and 'h'.
HTH,
Andreas

Hey everybody! I am new to CUDA and pycuda, but I am working hard to
understand.
My question is this:
Is there a way for me to use multiple python threads in order to run cuda
code on multiple GPUs?
I have created several threads, and in each I attempt to create a context
for a different cuda device, but I am getting an "invalid context" error
when I try to copy an array over.
Any suggestions?
Thanks in advance!
>>>Chris

Dear list members,
Just to be clear up front, this is an offer of collaboration in
research. This is not a job offer. There is no pay offered. I expect
that the research would result in one or more academic papers
published in journals. That would be the only personal benefit of the
collaboration.
I'm working on some statistical estimation methods based on matching
data to simulated nonparametric moments that are fitted using kernel
regression. This can be useful when moments are not calculable
analytically. The kernel regression part is computationally demanding,
especially when the data is high dimensional. Doing kernel smoothing
basically requires calculating the matrix of distances between the N
points x (each point has K coordinates) and P points y (also of
dimension K). So the problem is to fill out the NxP matrix D, where
D_ij is the distance between x_i and y_j. This is pretty obviously
easy to parallelize, and I have done this using MPI. I'm interested in
trying this with CUDA though pycuda. However, I'm not very handy with
pycuda, and not even all that handy with Python, and doing this myself
would be pretty slow. So, if anyone with Python and pycuda skills is
interested in collaborating on some research that I'm confident would
lead to one or more published academic papers, I'd be interested in
discussing it with you.
If interested, please contact me directly at michael.creel AT uab.es.
Please let's not clutter up this forum. Thanks to Andreas for
permission to post this message.
Michael Creel

cc'ing mailing list
On Thu, Mar 5, 2009 at 17:40, Nicholas Tung <ntung(a)ntung.com> wrote:
> Hi Andreas,
>
> Is there any device emulation support at the moment? What's your
> priority for this? I often get kernel crashes that bring down my entire
> machine, which makes debugging pretty slow.
>
> thanks,
> Nicholas
>

Hi!Is there any documentation for using the __constant__ construct with
pycuda. Specifically I want to be able to declare a constant array in the
'structure', and then fill in the constant structure from the host
interface. Somewhat similar to initializing the texture reference. I looked
on the pycuda help and did not find anything.
Thank you for the help and pycuda in general..
-Wish

Hi, I'm having *no* trouble copying numpy matrices of type complex64
to the GPU global memory and working on them as float2's, but when
attempting to use such numpy matrices to initialize *textures*, I seem
to be running into some problems.
First I try:
<code>
import pycuda.driver as cuda
#... import other things
row = numpy.random.randn(1,5).astype(numpy.complex64)
cuda.matrix_to_texref(row, texref, order="F")
</code>
which fails with: TypeError: cannot convert dtype 'complex64' to array format
I encounter the same error when attempting to use:
<code>
texref.set_array(cuda.matrix_to_array(row,"F"))
</code>
presumably for the same reason.
I then attempted to recreate the functionality in
cuda.matrix_to_array() by building an ArrayDescriptor and using
Memcpy2D, but making use of the ArrayDescriptor.num_channels field,
which I believe allows the creation of float2's when its format is
FLOAT. Specifically,
<code>
h, w = row.shape
descr = cuda.ArrayDescriptor()
descr.w = w
descr.h =h
descr.format = cuda.array_format.FLOAT
descr.num_channels = 2
descr.flags = 0
ary = cuda.Array(descr)
</code>
and this last line gives me the cryptic error:
"pycuda._driver.LogicError: cuArrayCreate failed: invalid value."
Any advice on setting up complex-valued textures using CUDA float2s
and numpy arrays would be most appreciated. Thanks,
Ahmed

Hi Nicholas,
On Samstag 28 Februar 2009, Nicholas Tung wrote:
> In much of the tutorial, numpy arrays are treated as buffer objects.
In the implementation, too. In particular, memcpy_htod doesn't really care
what it's given, as long as that something adheres to the Python buffer
interface. Numpy arrays do so most of the time.
> This doesn't always work, and should be pointed out somewhere...
What's the failure? If it's something non-intuitive, we should catch it in
PyCuda and give a nicer warning.
> I found
> out the hard way. I don't know if compressed matrices have the similar
> effects, but this code seems to fail for me.
>
> new = numpy.concatenate([original, numpy.zeros((original.shape[0], 0),
> uint32)], axis=1)
> gpu = drv.to_device(new.astype(uint32))
>
> The ndarray.copy function seems to resolve the problem. Sorry I'm not
> familiar with numpy internals.
Please post the output of
print original.shape
print original.strides
print original.flags
Andreas
PS: Please use the mailing list for all questions, for archival mainly.

Thank you,
On Mon, Mar 9, 2009 at 6:44 AM, Nicholas S-A <novanasa(a)gmail.com> wrote:
> Hi,
>
> * Chris Heuser <drummerdude3791(a)charter.net> [2009-03-09 00:30:58 -0400]:
>
> Is there a way for me to use multiple python threads in order to run cuda
>> code on multiple GPUs?
>>
>> I have created several threads, and in each I attempt to create a context
>> for a different cuda device, but I am getting an "invalid context" error
>> when I try to copy an array over.
>> Any suggestions?
>>
>
> I use pp (http://www.parallelpython.com/) to run different python
> instances. It is very easy and even allows execution on different
> machines (though I have not tried this with CUDA code). Essentially it
> spawns a new python instance, so each CUDA call runs in a different
> process instead of a different thread and the Global Interpreter Lock is
> avoided.
>
> There could be ways that involve less overhead, but this works fine for
> me.
>
> Hope that helps!
> Nicholas
>
I may end up doing just that. But before I change my implementation, I want
to make sure I am not simply making a noob error. A little bit more
desrciption:
Before the program splits into threads, i run cuda.Device.count() and store
that in a variable, in this case we will call it *cudaCnt*. Then later in
execution the program finally splits into *cudaCnt *threads. I am using a
thread class inheriting from threading.Thread, and in the overloaded
__init__ function, i create a context to a specific device. I am afraid that
the error might be in creating this context:
-------->self.dev = cuda.Device(self.ID)
-------->self.cntxt = self.dev.make_context()
Where self.ID represents one of the devices found through *cudaCnt*. the
first thread will have self.ID = 0, second has self.ID = 1, and so on, up to
*cudaCnt. *Then I begin executing cuda code in the class *run()*
definition.
1. Am I leaving something out in my creation of the individual contexts?
2. Am I correct in thinking that this will create contexts on separate cuda
devices?
Thanks again!
>>>Chris

Fix applied, thanks!
Andreas
On Sonntag 08 März 2009, Guy Billings wrote:
> Dear Andreas,
>
> I have recently started to play with CUDA. I dont know why but it
> captured my imagination somehow.
>
> I downloaded PyCUDA. It is really excellent. Thanks for doing this, it
> makes using CUDA much easier.
>
> I set it up on my computer (mac book pro, Mac OS X 10.5.6, Python
> 2.5). When I ran the demo_elementwise.py example I got the error:
>
> guy-billingss-macbook-pro:pycuda_examples guybillings$ python
> demo_elementwise.py
> Traceback (most recent call last):
> File "demo_elementwise.py", line 6, in <module>
> a_gpu = curand((50,))
> File "/Library/Python/2.5/site-packages/pycuda-0.92-py2.5-
> macosx-10.5-i386.egg/pycuda/curandom.py", line 215, in rand
> result.gpudata, numpy.random.randint(2**32), result.size)
> File "mtrand.pyx", line 700, in mtrand.RandomState.randint
> OverflowError: long int too large to convert to int
>
> I also go this same error for any script calling numpy.random.randint.
> I did some digging around on the internet and found this post:
>
> http://www.mail-archive.com/numpy-discussion@lists.sourceforge.net/msg03396
>.html
>
> Which explains the problem. I changed the file curandom.py in the base
> distribution so that line 700 goes from
>
> result.gpudata, numpy.random.randint(2**32), result.size)
>
> to
>
> result.gpudata, numpy.random.randint(2**31-1), result.size)
>
> and now it seems ok.
>
> Thought you might find this useful.
>
> Once again, thanks for a great bit of code.
>
> Yours sincerely
>
> Guy Billings, UK