Hi,
I read your documentation. The project is more then just a collection
of implementation. How much useful it is to abstract between PyCUDA
and PyOpenCL? Personnaly, I probably won't use that part, but I want
to abstract between CUDA and OpenCL.
I like the idea of making code generator that do transformation on the
input before doing other computation. This is something I wanted
Theano code generator to do, but I never got time to implement it.
What the current parameter derive_s_from_lp and derive_lp_from_s mean?
Also the code section is not something I call readable... Is is only
because I never used Mako? Andreas, I think you used mako, do you find
this redable?
I'm not sure that forcing people to use Mako is a good idea. Can we do
without it?
I still think that we need to provide the user not just with a common
gpu nd array object. We need to also provide fonctions on it. But I'm
not sure how we should do this.
Andreas, do you have an idea?
Fred
On Wed, Jul 18, 2012 at 10:29 AM, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> Hi all,
>
> Some of you may remember compyte discussions last year when I made the
> suggestion of creating a library with a compilation of GPGPU
> algorithms, working both with PyOpenCL and PyCuda. Long story short, I
> have finally found some time and created a prototype. The preliminary
> tutorial can be found at http://tigger.publicfields.net/tutorial.html
> and the project itself at https://github.com/Manticore/tigger . The
> examples are working and those few tests I have are running. The code
> in tigger.core is a mess, but I'm working on it.
>
> At this stage this library is a prototype (or even a proof of concept)
> whose fate is not sealed. My current plans are to refactor tigger.core
> and tigger.cluda (sorry for stealing the name, Andreas, I can change
> it :) over the course of a week or two and start filling it with
> actual algorithms. One of the first will be FFT, thus deprecating
> pyfft, list of other plans is in TODO.rst. On the other hand, the
> library could be made a part of compyte, although I'm not sure it'll
> fit its goals.
>
> Anyway, any sort of input is appreciated. Those who want to use the
> library for practical applications may want to wait for the next
> version, which is supposed to be somewhat stable.
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda

Hi all,
Some of you may remember compyte discussions last year when I made the
suggestion of creating a library with a compilation of GPGPU
algorithms, working both with PyOpenCL and PyCuda. Long story short, I
have finally found some time and created a prototype. The preliminary
tutorial can be found at http://tigger.publicfields.net/tutorial.html
and the project itself at https://github.com/Manticore/tigger . The
examples are working and those few tests I have are running. The code
in tigger.core is a mess, but I'm working on it.
At this stage this library is a prototype (or even a proof of concept)
whose fate is not sealed. My current plans are to refactor tigger.core
and tigger.cluda (sorry for stealing the name, Andreas, I can change
it :) over the course of a week or two and start filling it with
actual algorithms. One of the first will be FFT, thus deprecating
pyfft, list of other plans is in TODO.rst. On the other hand, the
library could be made a part of compyte, although I'm not sure it'll
fit its goals.
Anyway, any sort of input is appreciated. Those who want to use the
library for practical applications may want to wait for the next
version, which is supposed to be somewhat stable.

Hi,
i'm trying to pass a 3D array to a kernel. The kernel should keep, in parallel, all the vector of the stack and multiply it by 2.
But I have this error: error:
expression must have pointer-to-object type
I know that, obviously, C and python type are differently. In my opinion in the kernel i should declare a triple pointer, but reading some pycuda examples, i suppose that for C, each numpy array is seen as a single pointer.
This is my code:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time
from pycuda.gpuarray import to_gpu
a = numpy.random.randn(10,10,10)
a = a.astype(numpy.float32)
a_gpu=to_gpu(a)
mod = SourceModule("""
__global__ void doublify(float *a)
{
int k;
int idx=threadIdx.x +threadIdx.y*blockDim.y;
for(k=0;k<10;k++)
a[idx][k]*= 2;
}
""")
func = mod.get_function("doublify")
func(a_gpu, block=(10,10,1),grid=(1,1))
print a
print "Matrice moltiplicata per 2 :\n"
print a_gpu.get()
Thanks

In fact you're right..but in according to the theory could be kernelSize_h[0]/2. In effect if you see my CPU code i use kernelSize_h[0]/2.
import numpy as np
import scipy.ndimage as nd
import time
a=[1,0,1,1,0,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0,0,1,1,0,1,0,2,2,2,1]
#a = np.array([0,0,1,1,1,0,0,0,1,1,1,0,0,1,1])
lungA=len(a)
#b=[-5,-5,-5,-5,-5,-4,-4,5,5,5,5,5,4,4]
b=np.array([-1,-1,-1,1,1,-1,1,1,-1,1])
step=len(b)/2
corrCpu=np.zeros(lungA)
corrCpu=corrCpu.astype(np.int16)
a1=a[:(step)][::-1]
a2=a[-(step-1):][::-1]
a=np.append(a1,np.append(a,a2))
t1=time.time()
for i in range(step,lungA+step):
for j in range(0,len(b)):
corrCpu[i-step]=corrCpu[i-step]+(a[i-step+j]*b[j])
print time.time()-t1
a=[1,0,1,1,0,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0,0,1,1,0,1,0,2,2,2,1]
#a = np.array([0,0,1,1,1,0,0,0,1,1,1,0,0,1,1])
t2=time.time()
corrPy=nd.correlate1d(a,b,mode='reflect',origin=0)
print time.time()-t2
print "CorCpu= "
print corrCpu
print "CorPy= "
print corrPy
print "Differenza :\n"
print corrCpu-corrPy
It's a strange fact this..
> Date: Wed, 11 Jul 2012 22:48:25 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> Hi Andrea,
>
> On Wed, Jul 11, 2012 at 10:25 PM, Andrea Cesari
> <andrea_cesari(a)hotmail.it> wrote:
> > __global__ void gpu_kernel(int *corrGpu,int *aMod,int *b,int *kernelSize_h)
> > {
> > int j,step1=kernelSize_h[0]/2; // <---
> ...
> > """)
>
> When I remove /2 where the arrow points, I get results identical with
> the CPU version. Are you sure it is necessary there?
>
> > About your advise: when i do: int idx = threadIdx.x+step, idx doesn't start
> > from step1? so when j=0 idx-step1+j =0 ? it's wrong?
>
> Yes, sorry, that was my mistake. Everything is correct in this part.

Hi,
this is the full code:
import scipy
import scipy.ndimage as nd
import numpy as np
import numpy.ma as ma
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.driver import Context
import pycuda.tools
import pycuda.driver as cuda
import numpy
import os, sys, glob
from pycuda.compiler import SourceModule
import time
from pycuda.gpuarray import to_gpu
a = np.array([0,0,1,1,1,0,0,0,1,1,1,0,0,1,1],dtype=numpy.int32)
kernel_h= np.array([-1,-1,-1,1,1,1],dtype=numpy.int32)
step=len(kernel_h)/2
a1=a[:(step)][::-1]
a2=a[-(step-1):][::-1]
aMod=np.append(a1,np.append(a,a2))
numthreads=len(a)
c=np.zeros(len(a),dtype=numpy.int32)
kernelSize_h=np.zeros(2)
kernelSize_h=kernelSize_h.astype(numpy.int32)
kernelSize_h[0]=len(kernel_h)
kernelSize_h[1]=len(a)
mod=SourceModule("""
__global__ void gpu_kernel(int *corrGpu,int *aMod,int *b,int *kernelSize_h)
{
int j,step1=kernelSize_h[0]/2;
int idx = threadIdx.x+step1;
for(j=0;j<step1;j++)
corrGpu[idx-step1]+=aMod[idx+j-(step1)]*b[j];
}
""")
manipulate_vector=mod.get_function("gpu_kernel")
c_gpu=to_gpu(c)
manipulate_vector(c_gpu,drv.In(aMod),drv.In(kernel_h),drv.In(kernelSize_h),block=(numthreads,1,1),grid=(1,1))
print "Corr. GPU \n"
print c_gpu.get()
corrcpu=nd.correlate1d(a,kernel_h,mode='reflect')
print "Corr CPU= "
print corrcpu
print "Differenza : "
print c_gpu.get()-corrcpu
About your advise: when i do: int idx = threadIdx.x+step, idx doesn't start from step1? so when j=0 idx-step1+j =0 ? it's wrong?
> Date: Wed, 11 Jul 2012 10:03:33 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> Hi Andrea,
>
> Please send the full working script which anyone can save and execute
> without assembling it from the excerpts you provided. In the mean
> time, that's what I can say by looking at the kernel:
>
> On Wed, Jul 11, 2012 at 1:24 AM, Andrea Cesari <andrea_cesari(a)hotmail.it> wrote:
> > __global__ void gpu_kernel(int *corrGpu,int *aMod,int *b,int *kernelSize_h)
> >
> > {
> > int j,step1=kernelSize_h[0]/2;
> > int idx = threadIdx.x+step1;
> > for(j=0;j<step1;j++)
> > corrGpu[idx-step1]+=aMod[idx+j-(step1)]*b[j];
> >
> > }
>
> With the construction like "aMod[idx+j-(step1)]", reads sometimes
> occur outside of the aMod array (consider idx=0 and j=0, for example —
> you will be reading from aMod[-step1]).

Ok, i fix the problem! I always use mem_alloc and memcpy infact.
Another question, that in my opinion is another error using thread index.
I had to implement the equivalent of scipy.ndimage.convolve1d (that is a cross correlation function..but this isn't the problem)
I had done a sequential script to verify my algorithm first. The significant code is this:
step=len(b)/2
for i in range(step,lungA+step):
for j in range(0,len(b)):
corrCpu[i-step]=corrCpu[i-step]+(a[i-step+j]*b[j])
where 'a' and 'b' are the vector to be correlated.
The scipy function is:
nd.correlate1d(a,b,mode='reflect'),
where 'reflect' means that the first 'len(b)/2' elements of a and the last 'len(b)/2-1' elements of a, are repeated (after reflection), first and after 'a'.
To do this in my script i create an aMod vector in this mode:
a1=a[:(step)][::-1]
a2=a[-(step-1):][::-1]
a=np.append(a1,np.append(a,a2))
So for example if a =[-1,1,0,1,1,0,1,1] and b=[1,1,-1,-1] (so len(b)=4) then aMod=[1,-1,-1,1,0,1,1,0,1,1,1].
I converted all in pyCUDA in this mode:
numthreads=len(a)
c=np.zeros(len(a),dtype=numpy.int32)
kernelSize_h=np.zeros(2)
kernelSize_h=kernelSize_h.astype(numpy.int32)
kernelSize_h[0]=len(kernel_h)
kernelSize_h[1]=len(a)
mod=SourceModule("""
__global__ void gpu_kernel(int *corrGpu,int *aMod,int *b,int *kernelSize_h)
{
int j,step1=kernelSize_h[0]/2;
int idx = threadIdx.x+step1;
for(j=0;j<step1;j++)
corrGpu[idx-step1]+=aMod[idx+j-(step1)]*b[j];
}
""")
manipulate_vector=mod.get_function("gpu_kernel")
c_gpu=to_gpu(c)
manipulate_vector(c_gpu,drv.In(aMod),drv.In(kernel_h),drv.In(kernelSize_h),block=(numthreads,1,1),grid=(1,1))
print "Corr. GPU \n"
print c_gpu.get()
corrcpu=nd.correlate1d(a,kernel_h,mode='reflect')
print "Corr CPU= "
print corrcpu
print "Differenza : "
print c_gpu.get()-corrcpu
But the result (in cuda solution) is different, while in sequential script they match. I can't see the error! Can you help me?
I'm stuck here for days!
Thanks!
> Date: Wed, 11 Jul 2012 00:21:57 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> On Wed, Jul 11, 2012 at 12:15 AM, Andrea Cesari
> <andrea_cesari(a)hotmail.it> wrote:
> > so, the firs two elements of a vector are always garbage?
> > can i solve it allocating manually the memory? but should be the same of
> > drv.Out() i think..or no?
>
> The first two elements are garbage because:
> 1) you have not initialized them to anything (consequence of using drv.Out), and
> 2) you have not written anything there (consequence of using i =
> threadIdx.x + 2)
> So if you want them to contain something meaningful, fix either 1) or 2).
>
> 1) can be fixed, for example, like this:
>
> # to_gpu() takes numpy array, copies it to GPU and returns you
> reference to this GPU array
> from pycuda.gpuarray import to_gpu
>
> lung_vett=10;
> thread_index = mod.get_function("thread_index")
> dest=numpy.zeros(lung_vett);
> dest_gpu = to_gpu(dest)
> thread_index(dest_gpu, block=(lung_vett,1,1))
>
> print dest_gpu.get()

On Wed, Jul 11, 2012 at 12:15 AM, Andrea Cesari
<andrea_cesari(a)hotmail.it> wrote:
> so, the firs two elements of a vector are always garbage?
> can i solve it allocating manually the memory? but should be the same of
> drv.Out() i think..or no?
The first two elements are garbage because:
1) you have not initialized them to anything (consequence of using drv.Out), and
2) you have not written anything there (consequence of using i =
threadIdx.x + 2)
So if you want them to contain something meaningful, fix either 1) or 2).
1) can be fixed, for example, like this:
# to_gpu() takes numpy array, copies it to GPU and returns you
reference to this GPU array
from pycuda.gpuarray import to_gpu
lung_vett=10;
thread_index = mod.get_function("thread_index")
dest=numpy.zeros(lung_vett);
dest_gpu = to_gpu(dest)
thread_index(dest_gpu, block=(lung_vett,1,1))
print dest_gpu.get()

so, the firs two elements of a vector are always garbage?
can i solve it allocating manually the memory? but should be the same of drv.Out() i think..or no?
> Date: Wed, 11 Jul 2012 00:08:39 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> Hi Andrea,
>
> On Tue, Jul 10, 2012 at 11:55 PM, Andrea Cesari
> <andrea_cesari(a)hotmail.it> wrote:
> > But if i modify the kernel in this mode:
> >
> > const int i = threadIdx.x+2
> > dest[i]=i;
> >
> > the result is: [1 0 2 3 4 5 6 7 8 9]
> > while, in my opinion,should be [0,0,2,3,4,5,6,7,8,9] (confirmed by C code).
> > why?
>
> drv.Out() allocates an empty array which you later fill in your
> kernel, except for the first two elements. So these first two elements
> contain garbage, something that was in this part of video memory
> before. In your case it is 1 and 0, but could be anything. The fact
> that you create the CPU array with numpy.zeros() does not mean
> anything, since all the values in the CPU array are overwritten by the
> data from GPU.

> On Tue, Jul 10, 2012 at 11:55 PM, Andrea Cesari
> <andrea_cesari(a)hotmail.it> wrote:
>> But if i modify the kernel in this mode:
>>
>> const int i = threadIdx.x+2
>> dest[i]=i;
Moreover, here you are writing out of the 'dest' array and potentially
into some other allocated buffer.

But if i modify the kernel in this mode:
const int i = threadIdx.x+2
dest[i]=i;
the result is: [1 0 2 3 4 5 6 7 8 9]
while, in my opinion,should be [0,0,2,3,4,5,6,7,8,9] (confirmed by C code). why?
> Date: Tue, 10 Jul 2012 23:27:27 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> On Tue, Jul 10, 2012 at 11:22 PM, Andrea Cesari
> <andrea_cesari(a)hotmail.it> wrote:
> > If i understood correctly dt.dtype_to_ctype(type) tell me the corresponding
> > variable type on python?
>
> Not python per se, but numpy types (the ones you get from numpy
> arrays' dtype field). dtype_to_ctype() takes numpy datatype object and
> returns C equivalent. NAME_TO_DTYPE provides correspondence in other
> direction, i.e. from C types to numpy datatypes.