Dear cow-like-object,
a cow-like object <acowlikeobject(a)gmail.com> writes:
> Hi all,
>
> Very new to PyOpenCl. Trying to get my feet wet by running the examples.
>
> The following code works fine when running on Windows 8 / Intel HD 4000. I
> see a result of 0.0.
>
> [snip]
>
>
> However, when I run this "abstract" version I found in one of the
> presentations:
> [snip]
> *a_doubled = (2 * a_gpu).get()*
> [snip]
> *C:\Python27\lib\site-packages\pyopencl\__init__.pyc in
> kernel_set_scalar_arg_dtypes(self, arg_dtypes)*
> * 464 "length of argument type array (%d) and "*
> * 465 "CL-generated number of arguments (%d) do not
> agree"*
> *--> 466 % (len(arg_dtypes), self.num_args))*
> * 467*
> * 468 arg_type_chars = []*
> *
> *
> *AssertionError: length of argument type array (5) and CL-generated number
> of arguments (6) do not agree*
This looks like the Intel GPU CL implementation has a bug--it seems to
be miscounting arguments. You can verify this by inserting
print knl.sum.num_args
in the program that works. If my guess is right, it'd be great if you
could report this to Intel, here:
http://software.intel.com/en-us/forums/intel-opencl-sdk/
As a workaround, just rip out the assert that generated the error. You
lose a sanity check, but the functionality shouldn't be affected. (Or
run with 'python -O', which just disables *all* asserts.)
Hope that helps,
Andreas
Hi Pedro,
Unfortunately, I do not have time right now to write custom kernels,
so I'll cheat a little.
Attached is an archive with a Python script that performs the
calculation using Till's algorithm (with sizes tuned down to 80 and 50
so that my laptop can handle them). It uses reikna (see
http://reikna.publicfields.net) 0.2.4 to generate kernels, which I
included in the archive with comments in the main script about when
they are called and with which parameters. These kernels are rendered
versions of https://github.com/Manticore/reikna/blob/develop/reikna/transpose.mako
and https://github.com/Manticore/reikna/blob/develop/reikna/matrixmul.mako
(look at them if you want to know where all the magic numbers come
from) which are, in turn, just generalized transposition and dot
kernels from nVidia CUDA/OpenCL SDK examples. There are some weirdly
looking macros (which mostly do nothing in this case) in the rendered
versions, but I hope they are still quite readable.
Best regards,
Bogdan
On Sun, May 12, 2013 at 12:06 PM, Pedro Marcal <pedrovmarcal(a)gmail.com> wrote:
> Hi Bogdan,
> What does the .cl file look like?
> As a beginner, I would certainly appreciate being able bto see a complete
> example,
> Thanks,
> Pedro
>
>
> On Sat, May 11, 2013 at 5:57 PM, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
>>
>> Hi Till,
>>
>> I'd do it like this:
>>
>>
>> out = np.empty(800, 500, 500)
>>
>> t = a.reshape(800, 500 * 500).T
>> mat1 = t.reshape(500, 500, 800).T
>> # this results in a permutation
>> # mat1 = a.transpose(1, 0, 2)
>>
>> mat2 = mat1.T
>> # this results in a permutation
>> # mat2 = mat1.transpose(0, 2, 1) ( == a.transpose(1, 2, 0) )
>> #
>> # mat2 can be expressed in terms of "a" as well, but if transposes
>> involve actual data movement
>> # and not just strides swap, this will be faster.
>>
>> out = batched_dot(mat1, mat2)
>>
>>
>> Here batched_dot() goes over the 0-th dimension of both matrices and
>> dots dimensions 1 and 2. As far as I know, numpy does not have such
>> function, but it is a simple extension of the GPU dot kernel.
>>
>> > if (get_global_id()<max_gid) {return};
>> > Is there are a better way?
>>
>> It is a common pattern and it does not involve any significant
>> performance loss, as long as your max_gid is relatively close to the
>> actual global size.
>>
>> Best regards,
>> Bogdan
>>
>> On Sun, May 12, 2013 at 7:01 AM, Till Stensitzki
>> <tillsten(a)zedat.fu-berlin.de> wrote:
>> > Hi,
>> > i already use some simple kernels too accelerate my data
>> > fitting routines. One embarrassingly parallel part i am failing
>> > to accelerate with OpenCl is the following:
>> >
>> > a.shape is (500, 800, 10)
>> >
>> > out = np.empty(800, 500, 500)
>> > for i in range(800):
>> > mat = a[:, i, :]
>> > out[i, :, :] = np.dot(mat, mat.T)
>> >
>> > maybe anyone can help. Note that
>> > i could change the the dim order of a if would make it faster.
>> >
>> > Some questions:
>> > Some OpenCl says it would automatically set an optimal set an
>> > optimal local workgroup size, but my tests show they are not.
>> >
>> > Also the global workgroup size has to be a multiple of the local
>> > workgroup size. To use an faster local workgroup size ((128, 1) in my
>> > case)
>> > i use an additional kernel parameter max_gid and test at the beginning
>> > of the kernel
>> >
>> > if (get_global_id()<max_gid) {return};
>> >
>> > Is there are a better way?
>> >
>> >
>> > greetings and thanks for the nice package!
>> > Till Stensitzki
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > PyOpenCL mailing list
>> > PyOpenCL(a)tiker.net
>> > http://lists.tiker.net/listinfo/pyopencl
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net
>> http://lists.tiker.net/listinfo/pyopencl
>
>
Hi Bogdan,
What does the .cl file look like?
As a beginner, I would certainly appreciate being able bto see a complete
example,
Thanks,
Pedro
On Sat, May 11, 2013 at 5:57 PM, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> Hi Till,
>
> I'd do it like this:
>
>
> out = np.empty(800, 500, 500)
>
> t = a.reshape(800, 500 * 500).T
> mat1 = t.reshape(500, 500, 800).T
> # this results in a permutation
> # mat1 = a.transpose(1, 0, 2)
>
> mat2 = mat1.T
> # this results in a permutation
> # mat2 = mat1.transpose(0, 2, 1) ( == a.transpose(1, 2, 0) )
> #
> # mat2 can be expressed in terms of "a" as well, but if transposes
> involve actual data movement
> # and not just strides swap, this will be faster.
>
> out = batched_dot(mat1, mat2)
>
>
> Here batched_dot() goes over the 0-th dimension of both matrices and
> dots dimensions 1 and 2. As far as I know, numpy does not have such
> function, but it is a simple extension of the GPU dot kernel.
>
> > if (get_global_id()<max_gid) {return};
> > Is there are a better way?
>
> It is a common pattern and it does not involve any significant
> performance loss, as long as your max_gid is relatively close to the
> actual global size.
>
> Best regards,
> Bogdan
>
> On Sun, May 12, 2013 at 7:01 AM, Till Stensitzki
> <tillsten(a)zedat.fu-berlin.de> wrote:
> > Hi,
> > i already use some simple kernels too accelerate my data
> > fitting routines. One embarrassingly parallel part i am failing
> > to accelerate with OpenCl is the following:
> >
> > a.shape is (500, 800, 10)
> >
> > out = np.empty(800, 500, 500)
> > for i in range(800):
> > mat = a[:, i, :]
> > out[i, :, :] = np.dot(mat, mat.T)
> >
> > maybe anyone can help. Note that
> > i could change the the dim order of a if would make it faster.
> >
> > Some questions:
> > Some OpenCl says it would automatically set an optimal set an
> > optimal local workgroup size, but my tests show they are not.
> >
> > Also the global workgroup size has to be a multiple of the local
> > workgroup size. To use an faster local workgroup size ((128, 1) in my
> case)
> > i use an additional kernel parameter max_gid and test at the beginning
> > of the kernel
> >
> > if (get_global_id()<max_gid) {return};
> >
> > Is there are a better way?
> >
> >
> > greetings and thanks for the nice package!
> > Till Stensitzki
> >
> >
> >
> >
> >
> > _______________________________________________
> > PyOpenCL mailing list
> > PyOpenCL(a)tiker.net
> > http://lists.tiker.net/listinfo/pyopencl
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>
Hi,
i already use some simple kernels too accelerate my data
fitting routines. One embarrassingly parallel part i am failing
to accelerate with OpenCl is the following:
a.shape is (500, 800, 10)
out = np.empty(800, 500, 500)
for i in range(800):
mat = a[:, i, :]
out[i, :, :] = np.dot(mat, mat.T)
maybe anyone can help. Note that
i could change the the dim order of a if would make it faster.
Some questions:
Some OpenCl says it would automatically set an optimal set an
optimal local workgroup size, but my tests show they are not.
Also the global workgroup size has to be a multiple of the local
workgroup size. To use an faster local workgroup size ((128, 1) in my case)
i use an additional kernel parameter max_gid and test at the beginning
of the kernel
if (get_global_id()<max_gid) {return};
Is there are a better way?
greetings and thanks for the nice package!
Till Stensitzki
Hi James,
James Bergstra <james.bergstra(a)gmail.com> writes:
> Hi, I have written an opencl program that involves relatively small
> kernels. For a certain benchmarking script, I have added up the time used
> by kernels as 0.06 seconds, while the tightest python loop I can think of
> still requires .2 seconds to execute the 5000-or-so kernel calls. The
> program involves repeatedly looping through the same kernels, with the same
> arguments, so I was wondering if there was a way to enqueue several nd
> range kernels at once, at least from Python's perspective. Is there such a
> thing?
>
> In other words, supposing I have kernels A and B, taking arguments x and y,
> my program consists of:
> A(x); B(y); A(x); B(y); ....
>
> Ideally, I would like to enqueue 100 copies of the kernel sequence [(A, x),
> (B, y)], but being able to enqueue even [(A, x), (B, y)] with one call
> instead of 2 could be a big help.
What you're saying is that Kernel.__call__ is too slow for your current
purposes, correct?
First off, it'd be great if you could take a look at Kernel.set_args:
https://github.com/inducer/pyopencl/blob/master/pyopencl/__init__.py#L559
and Kernel.__call__:
https://github.com/inducer/pyopencl/blob/master/pyopencl/__init__.py#L528
to see if there's any fat that could be trimmed from your
perspective. I've tried to keep this code path as quick as I could, but
there might be something I've overlooked.
Next, if there's nothing to be had in that direction, you can simply
call Kernel.set_args once and then repeatedly call
cl.enqueue_nd_range_kernel() as done in Kernel.__call__ (see source link
above). That should get reasonably close to the rate that the OpenCL API
itself can sustain.
Hope that helps,
Andreas
Hi, I have written an opencl program that involves relatively small
kernels. For a certain benchmarking script, I have added up the time used
by kernels as 0.06 seconds, while the tightest python loop I can think of
still requires .2 seconds to execute the 5000-or-so kernel calls. The
program involves repeatedly looping through the same kernels, with the same
arguments, so I was wondering if there was a way to enqueue several nd
range kernels at once, at least from Python's perspective. Is there such a
thing?
In other words, supposing I have kernels A and B, taking arguments x and y,
my program consists of:
A(x); B(y); A(x); B(y); ....
Ideally, I would like to enqueue 100 copies of the kernel sequence [(A, x),
(B, y)], but being able to enqueue even [(A, x), (B, y)] with one call
instead of 2 could be a big help.
- James
Hello,
I have a float array of shape (150L, 4L) and dtype float32 and an
integer array.
I am sorting the float array by the integer array with the build-in
RadixSort. Then I am getting the float array back.
But after the sort the shape of the float array is (150L,)
and type is [(('x', 's0'), '<f4'), (('y', 's1'), '<f4'), (('z', 's2'),
'<f4'), (('w', 's3'), '<f4')]
What is the best way to convert the numpy array back to the initial
shape and type?
Any ideas?
Best regards,
Dieter
Hello,
I am mostly a pycuda user, but am investigating trying to use some of my
codes with pyopencl. My codes make heavy use of the numpy-like array. I
noticed that there doesn't seem to yet be a "__getitem__" function yet
defined, although the buffer objects themselves have one.
My needs are basically met by the version that is in pycuda, so I have
created a short patch to add the same behavior to pyopencl. It is fairly
limited in that it only supports 1-dimensional, non-strided slices. Is a
more comprehensive functionality already in the works? If not, would it be
possible to get this patch applied?
Thanks,
Alex