This looks like a case where operator overloading precedence in Python has gone wrong. You'll find that this does work:
>>> c = a + b
What's happening in your "b+a" case is that the Python interpreter is first executing:
b.__add__(a)
In this case, b.__add__() is the implementation provided by numpy, which then discovers that the right operand (a) is iterable. It then tries to iterate over the elements of a and return a new ndarray doing the element-wise sum. This iteration is what triggers the PyCUDA exception about trying to index "a" improperly.
Testing this, I found that this worked, much to my surprise:
>>> np.float64(1) + (1,2,3,4)
array([ 2., 3., 4., 5.])
In the "float(b) + a" case, by casting b to a regular Python float object, you get a different implementation of __add__(). The float version of __add__() raises an exception because "a" is not a numeric type. When that happens, the interpreter catches it and then tries to run:
a.__radd__(b)
At which point, the gpuarray version of the function takes precedence and does the right thing. By flipping the order of the operands, you can force Python to use the __add__ method you want, and ponder why it is that some language purists dislike operator overloading... :)
On Jul 27, 2012, at 2:33 PM, Anthony LaTorre <tlatorre9(a)gmail.com> wrote:
> I don't quite understand this behavior, and was hoping someone else might have some insight.
>
> >>> import numpy as np
> >>> import pycuda.autoinit
> >>> from pycuda import gpuarray as ga
> >>> a = ga.to_gpu(np.ones(10))
> >>> b = np.float64(1.0)
> >>> c = b + a
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/proj/chroma_env/local/lib/python2.7/site-packages/pycuda/gpuarray.py", line 690, in __getitem__
> raise ValueError("non-slice indexing not supported: %s" % (idx,))
> ValueError: non-slice indexing not supported: 0
> >>> c = float(b) + a
>
> It seems like adding a numpy.float64 number to a gpuarray is causing something to try and access individual items from the gpuarray, while adding a regular python float doesn't.
>
> I'm using PyCUDA version (2012,1) and the graphics card is a geforce gtx550 Ti on Ubuntu 12.04 linux.
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda

I don't quite understand this behavior, and was hoping someone else might
have some insight.
>>> import numpy as np
>>> import pycuda.autoinit
>>> from pycuda import gpuarray as ga
>>> a = ga.to_gpu(np.ones(10))
>>> b = np.float64(1.0)
>>> c = b + a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/proj/chroma_env/local/lib/python2.7/site-packages/pycuda/gpuarray.py",
line 690, in __getitem__
raise ValueError("non-slice indexing not supported: %s" % (idx,))
ValueError: non-slice indexing not supported: 0
>>> c = float(b) + a
It seems like adding a numpy.float64 number to a gpuarray is causing
something to try and access individual items from the gpuarray, while
adding a regular python float doesn't.
I'm using PyCUDA version (2012,1) and the graphics card is a geforce gtx550
Ti on Ubuntu 12.04 linux.

Amit Kumar <in4tunio(a)gmail.com> writes:
> Hi,
>
> Gpuocelot allows cuda programs to run over CPU. I am new to PyCuda as well
> as to Gpuocelot, except that I successfully installed both on my machine
> running Ubuntu. The hello program for Gpuocelot works. But as the machine
> does not have Nvidia GPU and GPU device drivers installed needed for
> PyCuda, pycuda._driver fails saying "no device".
>
> * Does anyone use PyCuda on top of Gpuocelot? I could not find anything in
> my search. If this is not already included in PyCuda, how difficult would
> it be to include?
GPUOcelot claims to implement the Nv driver API. That's exactly what
PyCUDA needs. So, in principle, this should work swimmingly. OTOH, if
you'd like to execute GPU-like code on a CPU, then either the Intel or
the AMD OpenCL implementation (with PyOpenCL) will likely provide better
performance.
Andreas

Hi,
Gpuocelot allows cuda programs to run over CPU. I am new to PyCuda as well
as to Gpuocelot, except that I successfully installed both on my machine
running Ubuntu. The hello program for Gpuocelot works. But as the machine
does not have Nvidia GPU and GPU device drivers installed needed for
PyCuda, pycuda._driver fails saying "no device".
* Does anyone use PyCuda on top of Gpuocelot? I could not find anything in
my search. If this is not already included in PyCuda, how difficult would
it be to include?
Best Regards
-Amit

Oh thanks,so the main problem is that not passing the array dimensions in the kernel i read outside the array. Now is clear.
For numpy, only yesterday i discovered that for 3d array the indexing is (z,y,x); and this made me mad :) But, i thought, that for bidimensional array was (x,y) as in C.
Thanks for the correction! In fact, among the many trials, i tried also to set (1,1,1) and (50,50) for a dim_x=dim_y=50. But this did not work, probally for the inversion of i and j in the cpu test code.
Thanks!
> Date: Thu, 19 Jul 2012 16:59:33 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> Hi Andrea,
>
> On Thu, Jul 19, 2012 at 4:37 PM, Andrea Cesari <andrea_cesari(a)hotmail.it> wrote:
> > yes..for example if i do:
> > dim_x=33
> > dim_y=33
> > then chenge grid and block to this: (32,32,1) and (2,1)
> > because i do ( 33*33=1089 threads, so grid= 1089/1024=1,063--> 2)
>
> When you do this, you read values outside of your array inside the
> kernel — since it's not aware of the actual size of your array, it
> tries all the x,y pairs from (0,0) to (32*2,32). My advice about using
> blockDim and gridDim to get array size applied only to the cases when
> it is equal to the grid size. In general case, you have to pass array
> dimensions inside the kernel. Please see the fixed script below. I
> also parametrized block size (my video card does not have CC2.0) and
> fixed some errors with the order of dimensions — be careful, in numpy
> it is (z, y, x), and in CUDA block/grid it is (x, y, z).
>
>
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import numpy
> import time
> from pycuda.gpuarray import to_gpu
> dim_x=50
> dim_y=100
> dim_z=10
> a = numpy.random.randn(dim_z,dim_y,dim_x)
> a = a.astype(numpy.int32)
> b=numpy.zeros((dim_y,dim_x),dtype=numpy.int32)
> a_gpu=to_gpu(a)
> b_gpu=to_gpu(b)
> mod = SourceModule("""
> __global__ void findmin(int *a,int *b,int dim_x, int dim_y, int dim_z)
> {
> int idx = threadIdx.x + blockIdx.x * blockDim.x; //OK
> int idy = threadIdx.y + blockIdx.y * blockDim.y; //OK
> if (idx >= dim_x || idy >= dim_y)
> return;
> int flat_id1 = idx + dim_x * idy ;
> int min=100;
> for(int idz = 0; idz <dim_z; idz++)
> {
> int flat_id = idx + dim_x * idy + (dim_x * dim_y) * idz; //OK
> if(a[flat_id]<min)
> {
> min=a[flat_id];
> b[flat_id1]=min;
> }
> }
>
> }
> """)
> block_size = 16
> func = mod.get_function("findmin")
> func(a_gpu, b_gpu, numpy.int32(dim_x), numpy.int32(dim_y), numpy.int32(dim_z),
> block=(block_size,block_size,1),
> grid=((dim_x - 1) / block_size + 1,(dim_y - 1) / block_size + 1))
>
> print a_gpu.get()
> print "b :\n"
> b=b_gpu.get()
> print b
> minimo=100
> b1=numpy.zeros((dim_y,dim_x),dtype=numpy.int32)
> for i in range(0,dim_x):
> for j in range(0,dim_y):
> minimo=min(a[:,j,i])
> b1[j,i]=minimo
>
> print "Difference between CPU:\n"
> print b1-b

yes..for example if i do:dim_x=33dim_y=33then chenge grid and block to this: (32,32,1) and (2,1) because i do ( 33*33=1089 threads, so grid= 1089/1024=1,063--> 2)
> Date: Thu, 19 Jul 2012 16:34:22 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> Hi Andrea,
>
> On Thu, Jul 19, 2012 at 4:26 PM, Andrea Cesari <andrea_cesari(a)hotmail.it> wrote:
> > The problem is that the results match with cpu only for dim_x and dim_y
> > minor of 32.
> > For higher dimensions the cpu and gpu results are different.
>
> When you change dim_x and dim_y values, do you also change grid and
> block size in call to findmin? Judging by the code, they are hardcoded
> to always be (32,32,1) and (1,1).

Hi, this is my code that, keep a 3d array, and for each pixel of the matrix find the minimum and put it to the corresponding pixel of a matrix b. Then compare the result with the cpu.
Obviously, with grid=(1,1) the max dimensions of the matrix can be 32x32 (1024 threads per block; GTX 580).
So, because i want a matrix of (500,500) , i I thought to declare a block=(32,32,1) and grid=(16,16). But it doesn't work. I noted that it work only with square matrix and with dimensions minor of 32.
I have this problem only with the 3d array, in fact whit bi-dimensional array i solved it doing:
int idx = threadIdx.x + blockIdx.x * blockDim.x;
with blocks=(1024,1,1) and grid=(245,1) (for (500,500) matrix) .
i tried to do the same thing but it doesn't work...
the code :
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time
from pycuda.gpuarray import to_gpu
dim_x=32
dim_y=32
dim_z=10
a = numpy.random.randn(dim_z,dim_y,dim_x)
a = a.astype(numpy.int32)
b=numpy.zeros((dim_x,dim_y),dtype=numpy.int32)
dimz=numpy.array([dim_z],dtype=numpy.int32)
a_gpu=to_gpu(a)
b_gpu=to_gpu(b)
dimz_gpu=to_gpu(dimz)
mod = SourceModule("""
__global__ void findmin(int *a,int *b,int *dimz_gpu)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x; //OK
int idy = threadIdx.y + blockIdx.y * blockDim.y; //OK
int x_width = blockDim.x * gridDim.x; //OK
int y_width = blockDim.y * gridDim.y;
int flat_id1 = idx + x_width * idy ;
int min=4294967296;
for(int idz = 0; idz <dimz_gpu[0]; idz++)
{
int flat_id = idx + x_width * idy + (x_width * y_width) * idz; //OK
if(a[flat_id]<min)
{
min=a[flat_id];
b[flat_id1]=min;
}
}
}
""")
func = mod.get_function("findmin")
func(a_gpu, b_gpu,dimz_gpu,block=(32,32,1),grid=(1,1))
print a_gpu.get()
print "b :\n"
b=b_gpu.get()
print b
minimo=100
b1=numpy.zeros((dim_x,dim_y),dtype=numpy.int32)
for i in range(0,dim_x):
for j in range(0,dim_y):
minimo=min(a[:,i,j])
b1[i,j]=minimo
print "Difference between CPU:\n"
print b1-b
Thanks!
Andrea
> Date: Thu, 12 Jul 2012 10:18:20 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
>
> Hi Andrea,
>
> Unfortunately, I am not quite familiar with the topic. Probably the
> issue here is incorrect padding, or incorrect mode of numpy function
> you are using for comparison — logically I'd expect mode to be 'wrap',
> not 'reflect'. Moreover, why did you even prefer to take correlate1d()
> as a reference instead of numpy.convolve() or
> scipy.fftpack.convolve()?
>
> Perhaps, it will be helpful to look at the sources of the two
> functions above and see how they do the padding. Scipy one is at
> https://github.com/scipy/scipy/blob/master/scipy/fftpack/src/convolve.c
>
>
>
> On Thu, Jul 12, 2012 at 2:38 AM, Andrea Cesari <andrea_cesari(a)hotmail.it> wrote:
> > Hi,
> > excuse me if i write you in private, but it isn't properly a pycuda problem.
> > I would ask you an opinion.
> > To do de equivalent of scipy.ndimage.convolve1d (that is a linear
> > convolution), im trying to do something like this:
> > a and b are the vector to convolve.
> >
> > A=FFT(a)
> > B=FFT(b) whit zero-padding at the end
> > CONV=A.*B
> > conv=INV_FFT(CONV)
> >
> > it's right? because i tried to do this in matlab( i have not yet installed
> > pyfft) but the results are different.
> > It's a mathematical problem?
> >
> > Thanks for your patience,
> > Andrea
> >
> >> Date: Wed, 11 Jul 2012 22:48:25 +1000
> >
> >> Subject: Re: [PyCUDA] Thread Problem
> >> From: mantihor(a)gmail.com
> >> To: andrea_cesari(a)hotmail.it
> >> CC: pycuda(a)tiker.net
> >>
> >> Hi Andrea,
> >>
> >> On Wed, Jul 11, 2012 at 10:25 PM, Andrea Cesari
> >> <andrea_cesari(a)hotmail.it> wrote:
> >> > __global__ void gpu_kernel(int *corrGpu,int *aMod,int *b,int
> >> > *kernelSize_h)
> >> > {
> >> > int j,step1=kernelSize_h[0]/2; // <---
> >> ...
> >> > """)
> >>
> >> When I remove /2 where the arrow points, I get results identical with
> >> the CPU version. Are you sure it is necessary there?
> >>
> >> > About your advise: when i do: int idx = threadIdx.x+step, idx doesn't
> >> > start
> >> > from step1? so when j=0 idx-step1+j =0 ? it's wrong?
> >>
> >> Yes, sorry, that was my mistake. Everything is correct in this part.

bruno villasenor <br1villasen(a)gmail.com> writes:
> Hi,
>
> I'm trying to run the custom reduction example, the code is:
>
> import pycuda.gpuarray as gpuarray
> import pycuda.driver as cuda
> import pycuda.autoinit
> import numpy
>
> a = gpuarray.arange(400, dtype=numpy.float32)
> b = gpuarray.arange(400, dtype=numpy.float32)
>
> from pycuda.reduction import ReductionKernel
> krnl = ReductionKernel(numpy.float32, neutral="0",
> reduce_expr="a+b", map_expr="x[i]*y[i]",
> arguments="float *x, float *y")
>
> my_dot_prod = krnl(a, b).get()
>
> I get the following error:
Use
krnl = ReductionKernel(numy.dtype(numpy.float32), neutral="0", ...
for now. Current PyCUDA git now also accepts your usage.
Andreas

Hi,
I'm trying to run the custom reduction example, the code is:
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
a = gpuarray.arange(400, dtype=numpy.float32)
b = gpuarray.arange(400, dtype=numpy.float32)
from pycuda.reduction import ReductionKernel
krnl = ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="float *x, float *y")
my_dot_prod = krnl(a, b).get()
I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/bruno/Desktop/Dropbox/Developer/pyCUDA/reduction.py in <module>()
12 arguments="float *x, float *y")
13
---> 14 my_dot_prod = krnl(a, b).get()
15
16
/usr/local/lib/python2.7/dist-packages/pycuda-2012.1-py2.7-linux-x86_64.egg/pycuda/reduction.pyc
in __call__(self, *args, **kwargs)
272 result = empty((block_count,), self.dtype_out,
repr_vec.allocator)
273
--> 274 kwargs =
dict(shared_size=self.block_size*self.dtype_out.itemsize)
275
276 #print block_count, seq_count, self.block_size, sz
TypeError: unsupported operand type(s) for *: 'int' and 'getset_descriptor'
WARNING: Failure executing file: <reduction.py>
Is this a numpy version incompatibility?
Any ideas?
Thanks a lot.
Bruno

Hi Frédéric,
On Thu, Jul 19, 2012 at 8:58 AM, Frédéric Bastien <nouiz(a)nouiz.org> wrote:
> How much useful it is to abstract between PyCUDA
> and PyOpenCL? Personnaly, I probably won't use that part, but I want
> to abstract between CUDA and OpenCL.
It was either that or to write almost identical libraries for PyCuda
and PyOpenCL. The user is by no means forced to use cluda context, it
is just a wrapper around PyCuda/PyOpenCL context that is passed to
computation classes. After that corresponding array classes and
synchronization functions from the target APIs can be used. But I
personally use this abstraction because I get better GPU performance
with CUDA, but I also need my programs to run on CPU with reasonable
speed (mostly, for debugging), and wor that OpenCL is preferable.
> I like the idea of making code generator that do transformation on the
> input before doing other computation. This is something I wanted
> Theano code generator to do, but I never got time to implement it.
> What the current parameter derive_s_from_lp and derive_lp_from_s mean?
These are two functions that are used to derive types. "s_from_lp"
means "stores from loads and parameters" (transformations can also
have scalar parameters, currently it can be seen in
test_matrixmul.py:test_preprocessing).
In the example the transformation is connected to "input pipeline",
i.e. to the input parameter "A". So it takes some values from new
external input parameters "A_re" and "A_im" (load=2) and combines them
into "A" (store=1). The derivation functions work in the following
situations:
1) When the object needs to derive types for the external signature
from its basis parameters, derive_lp_from_s() is called and supplied
with the type for "A", and expected to return types for "A_re" and
"A_im" (also for parameters, but there are none, so the empty list is
returned).
2) When the object needs to derive basis parameters from the arrays
supplied to prepare_for(), derive_s_from_lp() is called which performs
derivation in the other direction, producing data type for "A" from
data types of "A_re" and "A_im".
There are, in fact, four types of derivation functions. If we had a
transformation connected to the "output pipeline", we would need
derive_l_from_sp() and derive_sp_from_l().
Now that I think about it, "load" and "store" should probably be
called "input" and "output", it is less ambiguous :)
> Also the code section is not something I call readable... Is is only
> because I never used Mako? Andreas, I think you used mako, do you find
> this redable?
These are only one-liners, so, yes, they seem fine to me. Perhaps,
they would be more readable if I did not try to save on spacing, and
used some temporary variables? Also, do you find kernels (dummy.mako
and matrixmul.mako) hard to read too?
> I'm not sure that forcing people to use Mako is a good idea. Can we do
> without it?
The amount of Mako in the transformations can be reduced if I use
macros, for example, "LOAD1" instead of "${load.l1}". The latter was
chosen because I planned to add some error checking, which would be
less comprehensible with macros. It is much harder to avoid Mako for
non-trivial tasks though, like getting corresponding complex data type
for the real one, or providing support for multiplication of all
combinations of complex/real numbers. Is it really such an issue? It
is just Python code in curly braces, after all.
> I still think that we need to provide the user not just with a common
> gpu nd array object. We need to also provide fonctions on it. But I'm
> not sure how we should do this.
My project does not intersect with "general ndarray" idea at all. If
there was some general array class instead of cl.Array/GPUArray,
things would only make things easier for me. I just think that numpy
way of providing functions for arrays is not that good in GPU case, as
the overhead here is more significant, and should be made explicit and
manageable for the user.