we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem, 4 bytes cmem
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
We would be very grateful for every hint.
Up - any news on this item?
I went into the source code of PIL.Image and figured out that
Image.tobytes() doesn't simply return a pointer but builds the buffer
on the fly.
So I wonder if using PIL directly as a hostbuf is possible at all
(without doing horrible hacks on undocumented implementation details)?
On 14 May 2014 16:30, CRV§ADER//KY <crusaderky(a)gmail.com> wrote:
> Hi Andreas,
> sorry to pester - any news on this?
> On 14 April 2014 08:25, CRV§ADER//KY <crusaderky(a)gmail.com> wrote:
>> import pyopencl as cl
>> from PIL import Image
>> src = Image.open("test.jpg").convert('RGBA')
>> ctx = cl.create_some_context()
>> fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNORM_INT8)
>> src_buf = cl.Image(ctx,
>> flags=cl.mem_flags.READ_ONLY | cl.mem_flags.USE_HOST_PTR,
>> On 14 April 2014 07:34, CRV§ADER//KY <crusaderky(a)gmail.com> wrote:
>>> It is in one of the messages above already
>>> On 14 Apr 2014 01:55, "Andreas Kloeckner" <lists(a)informa.tiker.net>
>>>> "CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
>>>> > sorry for the long wait.
>>>> > ...nope, still doesn't work; same error as before.
>>>> Can you send some self-contained code to reproduce this? That would help
>>>> me get this working.