Dear all,
we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
training points.
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> pyopencl-2013.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py",
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
Host-Python-File
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem[0], 4 bytes cmem[3]
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
mistake.
We would be very grateful for every hint.
Best regards,
Justin Heinermann,
University Oldenburg
"CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
> src = Image.open("test.jpg")
> src.load()
> src = src.convert('RGBA')
>
> src_buf = cl.Image(self.ctx,
> flags=mf.READ_ONLY | mf.USE_HOST_PTR,
> format=fmt,
> shape=src.size,
> hostbuf=src
> )
>
> TypeError: expected an object with a writable buffer interface
Can you try the latest git version? I believe this was fixed there a
while ago.
Andreas
"CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
> Hi,
> how do I write a kernel that reads from a PIL.Image and writes to
> another, without doing using any unnecessary malloc and memcpy operations?
>
> So far I could only implement it like this:
>
> SHAPE = (16,16)
> def some_filter(src):
> fmt = cl.ImageFormat(cl.channel_order.RGBA,
> cl.channel_type.UNORM_INT8)
> src_np = numpy.fromstring(src.tostring(),
> dtype=numpy.uint32).reshape(src.size)
> dst_np = numpy.empty(src.size, dtype=numpy.uint32)
>
> src_buf = cl.Image(self.ctx, flags=mf.READ_ONLY, format=fmt,
> shape=src.size)
> dst_buf = cl.Image(self.ctx, flags=mf.WRITE_ONLY, format=fmt,
> shape=src.size)
> cl.enqueue_copy(self.queue, src_buf, src_np, origin=(0,0),
> region=src.size)
> self.prg.somefilter(queue, src.size, SHAPE, src_buf, dst_buf)
> cl.enqueue_copy(self.queue, dst_np, dst_buf, origin=(0,0),
> region=src.size)
>
> dst = Image.frombytes('RGBA', src.size, dst_np)
> return dst
>
> Kernel code:
>
> __constant sampler_t SAMPLER = CLK_NORMALIZED_COORDS_FALSE |
> CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
> kernel void somefilter(
> __read_only image2d_t srcImage,
> __write_only image2d_t dstImage,
> ) {
> const int2 pos = {get_global_id(0), get_global_id(1)};
> float4 pixel = read_imagef(srcImage, SAMPLER, pos);
> //omissis: actual filter operation
> write_imagef(dstImage, pos, pixel);
> }
>
> The above is terrible, because:
> 1) performs a first copy when I invoke PIL.Image.tostring()
> 2) performs a second copy when I invoke numpy.fromstring()
> 3) performs a third copy with Image.frombytes()
>
> How do I read/write directly from/to a PIL.Image to/from the GPU memory?
What error do you get if you pass the (PIL) Image as the hostbuf, with
USE_HOST_PTR?
Andreas
Hi,
how do I write a kernel that reads from a PIL.Image and writes to
another, without doing using any unnecessary malloc and memcpy operations?
So far I could only implement it like this:
SHAPE = (16,16)
def some_filter(src):
fmt = cl.ImageFormat(cl.channel_order.RGBA,
cl.channel_type.UNORM_INT8)
src_np = numpy.fromstring(src.tostring(),
dtype=numpy.uint32).reshape(src.size)
dst_np = numpy.empty(src.size, dtype=numpy.uint32)
src_buf = cl.Image(self.ctx, flags=mf.READ_ONLY, format=fmt,
shape=src.size)
dst_buf = cl.Image(self.ctx, flags=mf.WRITE_ONLY, format=fmt,
shape=src.size)
cl.enqueue_copy(self.queue, src_buf, src_np, origin=(0,0),
region=src.size)
self.prg.somefilter(queue, src.size, SHAPE, src_buf, dst_buf)
cl.enqueue_copy(self.queue, dst_np, dst_buf, origin=(0,0),
region=src.size)
dst = Image.frombytes('RGBA', src.size, dst_np)
return dst
Kernel code:
__constant sampler_t SAMPLER = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
kernel void somefilter(
__read_only image2d_t srcImage,
__write_only image2d_t dstImage,
) {
const int2 pos = {get_global_id(0), get_global_id(1)};
float4 pixel = read_imagef(srcImage, SAMPLER, pos);
//omissis: actual filter operation
write_imagef(dstImage, pos, pixel);
}
The above is terrible, because:
1) performs a first copy when I invoke PIL.Image.tostring()
2) performs a second copy when I invoke numpy.fromstring()
3) performs a third copy with Image.frombytes()
How do I read/write directly from/to a PIL.Image to/from the GPU memory?
TIA
Hello.
I'll be in Berlin on 2014-05-15, attending AWS Summit.
If anyone wants to meet there, or in Berlin around that time,
I'll gladly do so.
Best regards.
--
Tomasz Rybak GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
Hello.
What's going on with PyOpenCL repositories?
I cannot clone one on tiker.net:
$ git clone http://git.tiker.net/trees/pyopencl.git
Cloning into 'pyopencl'...
error: unable to open object pack
directory: /tmp/pyopencl/.git/objects/pack: Too many open files
fatal: failed to read object bff987ecd9aca7f6b2e1dc8a86af7086ee9cbcc2:
Too many open files
At the same time repository on gitbut contains old revision,
a2fb48462cf957590d2c9c6c8eb3776f781ca6b6 from
Wed Nov 27 14:40:35 2013 -0600, while the one from tiker.net
is a1d1603041ff90f3814dd263c33ee35c7de33ef9 from
Tue Feb 18 12:01:10 2014 -0600.
Short googling suggests the need to re-push or issuing
git gc on repository.
At the same time PyCUDA repositories seem OK.
--
Tomasz Rybak GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak