create Event from cl_event
by Gregor Thalhammer

Dear Andreas,
I am currently working on a cython based wrapper for the OpenCL FFT library from AMD: https://github.com/geggo/gpyfft
For this I need to create a pyopencl Event instance from a cl_event returned by the library. I attached a patch against recent pyopencl that adds this possibility, similar to the from_cl_mem_as_int() method of the MemoryObject class. Could you please add this to pyopencl.
Thanks for your help
Gregor
5 years, 8 months

Re: [PyOpenCL] Atomic operations in PyOpenCL
by Alexander

Hello,
I guess that some changes would fix the problem. The first atomic_add can
be written like:
atomic_add(&l_total, t);
The other atom_add:
atom_add(&result[0], l_total);
The thing is that atomic_add needs the first parameter (where to add) as a
pointer, what is easily achieved with &. Hope that your problem has been
solved.
Regards,
Alex.
On Mon, Sep 17, 2012 at 6:44 PM, J Diviney <justdivs(a)gmail.com> wrote:
> Hi,
>
> I'm writing a relatively simple function to sum up a set of results from a
> simulation I'm running, and want to do it on the GPU, since it involves
> summing a large dataset. The function is as follows:
>
>
> __kernel void k(__global float4 *spins,
>> __global uint *result)
>> {
>> ushort lid = 2 * get_local_id(0);
>> ushort gid = 2 * get_group_id(1);
>>
>> float total = 0;
>> local uint l_total;
>>
>> if(lid == 0)
>> {l_total = 0;}
>>
>> barrier(CLK_LOCAL_MEM_FENCE);
>>
>> total += spins[lid * Y + gid].w;
>> total += spins[lid * Y + gid + 1].w;
>> total += spins[(lid + 1) * Y + gid].w;
>> total += spins[(lid + 1) * Y + gid + 1].w;
>>
>> uint t = (uint) total;
>>
>> uint atomic_add(uint l_total, uint t);
>>
>> barrier(CLK_LOCAL_MEM_FENCE);
>>
>> if(lid == 0)
>> {uint atom_add(uint result[0], uint l_total);}
>> }
>> """
>>
>
> Note that this isn't my original code, I've been modifying it loads to try
> and get it to work, to no avail. (Y is an uint that I declare elsewhere).
> I've also tried adding __OVERLOADABLE__, __local and __global to the
> additions, but that didn't help either.
>
> The error message I get when compiling is:
>
>
> pyopencl.RuntimeError: clBuildProgram failed: build program failure -
>>
>> Build on <pyopencl.Device 'GeForce GTX 560 Ti' on 'NVIDIA CUDA' at
>> 0x29f4d80>:
>>
>> :36:10: error: overloaded function 'atom_add' must have the
>> 'overloadable' attribute
>> uint atomic_add(uint l_total, uint t);
>> ^
>> <built-in>:3942:20: note: instantiated from:
>> #define atomic_add atom_add
>> ^
>> <built-in>:3845:24: note: previous overload of function is here
>> ulong __OVERLOADABLE__ atom_add(__local volatile ulong *ptr, ulong val);
>> ^
>> :41:11: error: overloaded function 'atom_add' must have the
>> 'overloadable' attribute
>> {uint atom_add(uint result[0], uint l_total);}
>> ^
>> :36:10: note: previous overload of function is here
>> uint atomic_add(uint l_total, uint t);
>> ^
>> <built-in>:3942:20: note: instantiated from:
>> #define atomic_add atom_add
>> ^
>>
>
> Any help would be greatly appreciated.
>
> Thanks,
> Justin
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>
>
6 years, 6 months

Atomic operations in PyOpenCL
by J Diviney

Hi,
I'm writing a relatively simple function to sum up a set of results from a
simulation I'm running, and want to do it on the GPU, since it involves
summing a large dataset. The function is as follows:
__kernel void k(__global float4 *spins,
> __global uint *result)
> {
> ushort lid = 2 * get_local_id(0);
> ushort gid = 2 * get_group_id(1);
>
> float total = 0;
> local uint l_total;
>
> if(lid == 0)
> {l_total = 0;}
>
> barrier(CLK_LOCAL_MEM_FENCE);
>
> total += spins[lid * Y + gid].w;
> total += spins[lid * Y + gid + 1].w;
> total += spins[(lid + 1) * Y + gid].w;
> total += spins[(lid + 1) * Y + gid + 1].w;
>
> uint t = (uint) total;
>
> uint atomic_add(uint l_total, uint t);
>
> barrier(CLK_LOCAL_MEM_FENCE);
>
> if(lid == 0)
> {uint atom_add(uint result[0], uint l_total);}
> }
> """
>
Note that this isn't my original code, I've been modifying it loads to try
and get it to work, to no avail. (Y is an uint that I declare elsewhere).
I've also tried adding __OVERLOADABLE__, __local and __global to the
additions, but that didn't help either.
The error message I get when compiling is:
pyopencl.RuntimeError: clBuildProgram failed: build program failure -
>
> Build on <pyopencl.Device 'GeForce GTX 560 Ti' on 'NVIDIA CUDA' at
> 0x29f4d80>:
>
> :36:10: error: overloaded function 'atom_add' must have the 'overloadable'
> attribute
> uint atomic_add(uint l_total, uint t);
> ^
> <built-in>:3942:20: note: instantiated from:
> #define atomic_add atom_add
> ^
> <built-in>:3845:24: note: previous overload of function is here
> ulong __OVERLOADABLE__ atom_add(__local volatile ulong *ptr, ulong val);
> ^
> :41:11: error: overloaded function 'atom_add' must have the 'overloadable'
> attribute
> {uint atom_add(uint result[0], uint l_total);}
> ^
> :36:10: note: previous overload of function is here
> uint atomic_add(uint l_total, uint t);
> ^
> <built-in>:3942:20: note: instantiated from:
> #define atomic_add atom_add
> ^
>
Any help would be greatly appreciated.
Thanks,
Justin
6 years, 6 months

More good news
by Andreas Kloeckner

Hi all,
now that we have a flexible scan, a lot of stuff becomes quite easy:
http://documen.tician.de/pyopencl/array.html#sorting
:)
Performance isn't a dream yet, but I've also done exactly zero
tuning. It manages 34 MKeys/s on Fermi and 42 MKeys/s on Tahiti. For
comparison, numpy does about 10 MKeys/s on a CPU with a decent memory
system. The CL code on the CPU achieves about 10 MKeys/s on 4+ cores,
with the AMD implementation being 50% faster than Intel. (All this is on
32-bit integers.) If you've got some time to help tune this... :P
But the real good news here is that a) this was pretty easy to put
together on top of the existing scan primitive, and b) it actually
yields code that works on quite a bunch of CL implementations.
Hope you're finding this as exciting as me. :)
Andreas
6 years, 6 months