Re: [PyOpenCL] Error running "abstract" example
by Andreas Kloeckner

Dear cow-like-object,
a cow-like object <acowlikeobject(a)gmail.com> writes:
> Hi all,
>
> Very new to PyOpenCl. Trying to get my feet wet by running the examples.
>
> The following code works fine when running on Windows 8 / Intel HD 4000. I
> see a result of 0.0.
>
> [snip]
>
>
> However, when I run this "abstract" version I found in one of the
> presentations:
> [snip]
> *a_doubled = (2 * a_gpu).get()*
> [snip]
> *C:\Python27\lib\site-packages\pyopencl\__init__.pyc in
> kernel_set_scalar_arg_dtypes(self, arg_dtypes)*
> * 464 "length of argument type array (%d) and "*
> * 465 "CL-generated number of arguments (%d) do not
> agree"*
> *--> 466 % (len(arg_dtypes), self.num_args))*
> * 467*
> * 468 arg_type_chars = []*
> *
> *
> *AssertionError: length of argument type array (5) and CL-generated number
> of arguments (6) do not agree*
This looks like the Intel GPU CL implementation has a bug--it seems to
be miscounting arguments. You can verify this by inserting
print knl.sum.num_args
in the program that works. If my guess is right, it'd be great if you
could report this to Intel, here:
http://software.intel.com/en-us/forums/intel-opencl-sdk/
As a workaround, just rip out the assert that generated the error. You
lose a sanity check, but the functionality shouldn't be affected. (Or
run with 'python -O', which just disables *all* asserts.)
Hope that helps,
Andreas
6 years, 8 months

Error running "abstract" example
by a cow-like object

Hi all,
Very new to PyOpenCl. Trying to get my feet wet by running the examples.
The following code works fine when running on Windows 8 / Intel HD 4000. I
see a result of 0.0.
*import pyopencl as cl*
*import numpy*
*import numpy.linalg as la*
*a = numpy.random.rand(50000).astype(numpy.float32)*
*b = numpy.random.rand(50000).astype(numpy.float32)*
*
*
*ctx = cl.create_some_context()*
*queue = cl.CommandQueue(ctx)*
*
*
*mf = cl.mem_flags*
*a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)*
*b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)*
*dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)*
*
*
*prg = cl.Program(ctx, """*
* __kernel void sum(__global const float *a,*
* __global const float *b, __global float *c)*
* {*
* int gid = get_global_id(0);*
* c[gid] = a[gid] + b[gid];*
* }*
* """).build()*
*
*
*prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)*
*
*
*a_plus_b = numpy.empty_like(a)*
*cl.enqueue_copy(queue, a_plus_b, dest_buf)*
*
*
*print la.norm(a_plus_b - (a+b))*
*
*
However, when I run this "abstract" version I found in one of the
presentations:
*import numpy*
*import pyopencl as cl*
*import pyopencl.array as cl_array*
*
*
*ctx = cl.create_some_context()*
*queue = cl.CommandQueue(ctx)*
*
*
*a_gpu = cl_array.to_device(ctx, queue,
numpy.random.randn(4,4).astype(numpy.float32))*
*a_doubled = (2 * a_gpu).get()*
*print a_doubled*
*print a_gpu*
I get the following error:
*---------------------------------------------------------------------------
*
*AssertionError Traceback (most recent call last)
*
*C:\Python27\lib\site-packages\IPython\utils\py3compat.pyc in
execfile(fname, glob, loc)*
* 169 else:*
* 170 filename = fname*
*--> 171 exec compile(scripttext, filename, 'exec') in glob, loc
*
* 172 else:*
* 173 def execfile(fname, *where):*
*
*
*abstract.py in <module>()*
* 8 a_gpu = cl_array.to_device(ctx, queue,
numpy.random.randn(4,4).astype(numpy.float32))*
* 9*
*---> 10 a_doubled = (2 * a_gpu).get()*
* 11 print a_doubled*
* 12 print a_gpu*
*
*
*C:\Python27\lib\site-packages\pyopencl\array.pyc in __rmul__(self, scalar)*
* 625 common_dtype = _get_common_dtype(self, scalar, self.queue)*
* 626 result = self._new_like_me(common_dtype)*
*--> 627 self._axpbz(result, common_dtype.type(scalar), self,
self.dtype.type(0))*
* 628 return result*
* 629*
*
*
*C:\Python27\lib\site-packages\pyopencl\array.pyc in kernel_runner(*args,
**kwargs)*
* 153 queue = kwargs.pop("queue", None) or repr_ary.queue*
* 154*
*--> 155 knl = kernel_getter(*args)*
* 156*
* 157 gs, ls = repr_ary.get_sizes(queue,*
*
*
*C:\Python27\lib\site-packages\pyopencl\array.pyc in _axpbz(out, a, x, b,
queue)*
* 405 b = np.array(b)*
* 406 return elementwise.get_axpbz_kernel(out.context,*
*--> 407 a.dtype, x.dtype, b.dtype, out.dtype)*
* 408*
* 409 @staticmethod*
*
*
*C:\Python27\lib\site-packages\pyopencl\elementwise.pyc in
get_axpbz_kernel(context, dtype_a, dtype_x, dtype_b, dtype_z)*
*
*
*C:\Python27\lib\site-packages\pyopencl\tools.pyc in
first_arg_dependent_memoize(func, cl_object, *args)*
* 81 first_arg_dependent_memoized_functions.append(func)*
* 82 arg_dict = ctx_dict.setdefault(cl_object, {})*
*---> 83 result = func(cl_object, *args)*
* 84 arg_dict[args] = result*
* 85 return result*
*
*
*C:\Python27\lib\site-packages\pyopencl\elementwise.pyc in
get_axpbz_kernel(context, dtype_a, dtype_x, dtype_b, dtype_z)*
* 421 },*
* 422 "z[i] = %s + %s" % (ax, b),*
*--> 423 name="axpb")*
* 424*
* 425*
*
*
*C:\Python27\lib\site-packages\pyopencl\elementwise.pyc in
get_elwise_kernel(context, arguments, operation, name, options, **kwargs)*
* 127 func, arguments = get_elwise_kernel_and_types(*
* 128 context, arguments, operation,*
*--> 129 name=name, options=options, **kwargs)*
* 130*
* 131 return func*
*
*
*C:\Python27\lib\site-packages\pyopencl\elementwise.pyc in
get_elwise_kernel_and_types(context, arguments, operation, name, options,
preamble, **kwargs)*
* 112*
* 113 kernel = getattr(prg, name)*
*--> 114 kernel.set_scalar_arg_dtypes(scalar_arg_dtypes)*
* 115*
* 116 return kernel, parsed_args*
*
*
*C:\Python27\lib\site-packages\pyopencl\__init__.pyc in
kernel_set_scalar_arg_dtypes(self, arg_dtypes)*
* 464 "length of argument type array (%d) and "*
* 465 "CL-generated number of arguments (%d) do not
agree"*
*--> 466 % (len(arg_dtypes), self.num_args))*
* 467*
* 468 arg_type_chars = []*
*
*
*AssertionError: length of argument type array (5) and CL-generated number
of arguments (6) do not agree*
*
*
How would I go about troubleshooting this? Was that example even meant to
be run as-is?
Thanks very much!
6 years, 8 months

Re: [PyOpenCL] Mutiple-matrix products and two questions.
by Bogdan Opanchuk

Hi Pedro,
Unfortunately, I do not have time right now to write custom kernels,
so I'll cheat a little.
Attached is an archive with a Python script that performs the
calculation using Till's algorithm (with sizes tuned down to 80 and 50
so that my laptop can handle them). It uses reikna (see
http://reikna.publicfields.net) 0.2.4 to generate kernels, which I
included in the archive with comments in the main script about when
they are called and with which parameters. These kernels are rendered
versions of https://github.com/Manticore/reikna/blob/develop/reikna/transpose.mako
and https://github.com/Manticore/reikna/blob/develop/reikna/matrixmul.mako
(look at them if you want to know where all the magic numbers come
from) which are, in turn, just generalized transposition and dot
kernels from nVidia CUDA/OpenCL SDK examples. There are some weirdly
looking macros (which mostly do nothing in this case) in the rendered
versions, but I hope they are still quite readable.
Best regards,
Bogdan
On Sun, May 12, 2013 at 12:06 PM, Pedro Marcal <pedrovmarcal(a)gmail.com> wrote:
> Hi Bogdan,
> What does the .cl file look like?
> As a beginner, I would certainly appreciate being able bto see a complete
> example,
> Thanks,
> Pedro
>
>
> On Sat, May 11, 2013 at 5:57 PM, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
>>
>> Hi Till,
>>
>> I'd do it like this:
>>
>>
>> out = np.empty(800, 500, 500)
>>
>> t = a.reshape(800, 500 * 500).T
>> mat1 = t.reshape(500, 500, 800).T
>> # this results in a permutation
>> # mat1 = a.transpose(1, 0, 2)
>>
>> mat2 = mat1.T
>> # this results in a permutation
>> # mat2 = mat1.transpose(0, 2, 1) ( == a.transpose(1, 2, 0) )
>> #
>> # mat2 can be expressed in terms of "a" as well, but if transposes
>> involve actual data movement
>> # and not just strides swap, this will be faster.
>>
>> out = batched_dot(mat1, mat2)
>>
>>
>> Here batched_dot() goes over the 0-th dimension of both matrices and
>> dots dimensions 1 and 2. As far as I know, numpy does not have such
>> function, but it is a simple extension of the GPU dot kernel.
>>
>> > if (get_global_id()<max_gid) {return};
>> > Is there are a better way?
>>
>> It is a common pattern and it does not involve any significant
>> performance loss, as long as your max_gid is relatively close to the
>> actual global size.
>>
>> Best regards,
>> Bogdan
>>
>> On Sun, May 12, 2013 at 7:01 AM, Till Stensitzki
>> <tillsten(a)zedat.fu-berlin.de> wrote:
>> > Hi,
>> > i already use some simple kernels too accelerate my data
>> > fitting routines. One embarrassingly parallel part i am failing
>> > to accelerate with OpenCl is the following:
>> >
>> > a.shape is (500, 800, 10)
>> >
>> > out = np.empty(800, 500, 500)
>> > for i in range(800):
>> > mat = a[:, i, :]
>> > out[i, :, :] = np.dot(mat, mat.T)
>> >
>> > maybe anyone can help. Note that
>> > i could change the the dim order of a if would make it faster.
>> >
>> > Some questions:
>> > Some OpenCl says it would automatically set an optimal set an
>> > optimal local workgroup size, but my tests show they are not.
>> >
>> > Also the global workgroup size has to be a multiple of the local
>> > workgroup size. To use an faster local workgroup size ((128, 1) in my
>> > case)
>> > i use an additional kernel parameter max_gid and test at the beginning
>> > of the kernel
>> >
>> > if (get_global_id()<max_gid) {return};
>> >
>> > Is there are a better way?
>> >
>> >
>> > greetings and thanks for the nice package!
>> > Till Stensitzki
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > PyOpenCL mailing list
>> > PyOpenCL(a)tiker.net
>> > http://lists.tiker.net/listinfo/pyopencl
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net
>> http://lists.tiker.net/listinfo/pyopencl
>
>
6 years, 8 months

Re: [PyOpenCL] Mutiple-matrix products and two questions.
by Pedro Marcal

Hi Bogdan,
What does the .cl file look like?
As a beginner, I would certainly appreciate being able bto see a complete
example,
Thanks,
Pedro
On Sat, May 11, 2013 at 5:57 PM, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> Hi Till,
>
> I'd do it like this:
>
>
> out = np.empty(800, 500, 500)
>
> t = a.reshape(800, 500 * 500).T
> mat1 = t.reshape(500, 500, 800).T
> # this results in a permutation
> # mat1 = a.transpose(1, 0, 2)
>
> mat2 = mat1.T
> # this results in a permutation
> # mat2 = mat1.transpose(0, 2, 1) ( == a.transpose(1, 2, 0) )
> #
> # mat2 can be expressed in terms of "a" as well, but if transposes
> involve actual data movement
> # and not just strides swap, this will be faster.
>
> out = batched_dot(mat1, mat2)
>
>
> Here batched_dot() goes over the 0-th dimension of both matrices and
> dots dimensions 1 and 2. As far as I know, numpy does not have such
> function, but it is a simple extension of the GPU dot kernel.
>
> > if (get_global_id()<max_gid) {return};
> > Is there are a better way?
>
> It is a common pattern and it does not involve any significant
> performance loss, as long as your max_gid is relatively close to the
> actual global size.
>
> Best regards,
> Bogdan
>
> On Sun, May 12, 2013 at 7:01 AM, Till Stensitzki
> <tillsten(a)zedat.fu-berlin.de> wrote:
> > Hi,
> > i already use some simple kernels too accelerate my data
> > fitting routines. One embarrassingly parallel part i am failing
> > to accelerate with OpenCl is the following:
> >
> > a.shape is (500, 800, 10)
> >
> > out = np.empty(800, 500, 500)
> > for i in range(800):
> > mat = a[:, i, :]
> > out[i, :, :] = np.dot(mat, mat.T)
> >
> > maybe anyone can help. Note that
> > i could change the the dim order of a if would make it faster.
> >
> > Some questions:
> > Some OpenCl says it would automatically set an optimal set an
> > optimal local workgroup size, but my tests show they are not.
> >
> > Also the global workgroup size has to be a multiple of the local
> > workgroup size. To use an faster local workgroup size ((128, 1) in my
> case)
> > i use an additional kernel parameter max_gid and test at the beginning
> > of the kernel
> >
> > if (get_global_id()<max_gid) {return};
> >
> > Is there are a better way?
> >
> >
> > greetings and thanks for the nice package!
> > Till Stensitzki
> >
> >
> >
> >
> >
> > _______________________________________________
> > PyOpenCL mailing list
> > PyOpenCL(a)tiker.net
> > http://lists.tiker.net/listinfo/pyopencl
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>
6 years, 8 months

Mutiple-matrix products and two questions.
by Till Stensitzki

Hi,
i already use some simple kernels too accelerate my data
fitting routines. One embarrassingly parallel part i am failing
to accelerate with OpenCl is the following:
a.shape is (500, 800, 10)
out = np.empty(800, 500, 500)
for i in range(800):
mat = a[:, i, :]
out[i, :, :] = np.dot(mat, mat.T)
maybe anyone can help. Note that
i could change the the dim order of a if would make it faster.
Some questions:
Some OpenCl says it would automatically set an optimal set an
optimal local workgroup size, but my tests show they are not.
Also the global workgroup size has to be a multiple of the local
workgroup size. To use an faster local workgroup size ((128, 1) in my case)
i use an additional kernel parameter max_gid and test at the beginning
of the kernel
if (get_global_id()<max_gid) {return};
Is there are a better way?
greetings and thanks for the nice package!
Till Stensitzki
6 years, 8 months

Re: [PyOpenCL] batch enqueue
by Andreas Kloeckner

Hi James,
James Bergstra <james.bergstra(a)gmail.com> writes:
> Hi, I have written an opencl program that involves relatively small
> kernels. For a certain benchmarking script, I have added up the time used
> by kernels as 0.06 seconds, while the tightest python loop I can think of
> still requires .2 seconds to execute the 5000-or-so kernel calls. The
> program involves repeatedly looping through the same kernels, with the same
> arguments, so I was wondering if there was a way to enqueue several nd
> range kernels at once, at least from Python's perspective. Is there such a
> thing?
>
> In other words, supposing I have kernels A and B, taking arguments x and y,
> my program consists of:
> A(x); B(y); A(x); B(y); ....
>
> Ideally, I would like to enqueue 100 copies of the kernel sequence [(A, x),
> (B, y)], but being able to enqueue even [(A, x), (B, y)] with one call
> instead of 2 could be a big help.
What you're saying is that Kernel.__call__ is too slow for your current
purposes, correct?
First off, it'd be great if you could take a look at Kernel.set_args:
https://github.com/inducer/pyopencl/blob/master/pyopencl/__init__.py#L559
and Kernel.__call__:
https://github.com/inducer/pyopencl/blob/master/pyopencl/__init__.py#L528
to see if there's any fat that could be trimmed from your
perspective. I've tried to keep this code path as quick as I could, but
there might be something I've overlooked.
Next, if there's nothing to be had in that direction, you can simply
call Kernel.set_args once and then repeatedly call
cl.enqueue_nd_range_kernel() as done in Kernel.__call__ (see source link
above). That should get reasonably close to the rate that the OpenCL API
itself can sustain.
Hope that helps,
Andreas
6 years, 8 months

batch enqueue
by James Bergstra

Hi, I have written an opencl program that involves relatively small
kernels. For a certain benchmarking script, I have added up the time used
by kernels as 0.06 seconds, while the tightest python loop I can think of
still requires .2 seconds to execute the 5000-or-so kernel calls. The
program involves repeatedly looping through the same kernels, with the same
arguments, so I was wondering if there was a way to enqueue several nd
range kernels at once, at least from Python's perspective. Is there such a
thing?
In other words, supposing I have kernels A and B, taking arguments x and y,
my program consists of:
A(x); B(y); A(x); B(y); ....
Ideally, I would like to enqueue 100 copies of the kernel sequence [(A, x),
(B, y)], but being able to enqueue even [(A, x), (B, y)] with one call
instead of 2 could be a big help.
- James
6 years, 8 months

RadixSort shape and dtype
by Dieter Morgenroth

Hello,
I have a float array of shape (150L, 4L) and dtype float32 and an
integer array.
I am sorting the float array by the integer array with the build-in
RadixSort. Then I am getting the float array back.
But after the sort the shape of the float array is (150L,)
and type is [(('x', 's0'), '<f4'), (('y', 's1'), '<f4'), (('z', 's2'),
'<f4'), (('w', 's3'), '<f4')]
What is the best way to convert the numpy array back to the initial
shape and type?
Any ideas?
Best regards,
Dieter
6 years, 8 months

__getitem__ for pyopencl array
by Alex Nitz

Hello,
I am mostly a pycuda user, but am investigating trying to use some of my
codes with pyopencl. My codes make heavy use of the numpy-like array. I
noticed that there doesn't seem to yet be a "__getitem__" function yet
defined, although the buffer objects themselves have one.
My needs are basically met by the version that is in pycuda, so I have
created a short patch to add the same behavior to pyopencl. It is fairly
limited in that it only supports 1-dimensional, non-strided slices. Is a
more comprehensive functionality already in the works? If not, would it be
possible to get this patch applied?
Thanks,
Alex
6 years, 8 months