Ps forgot to say, if SIZE can't be evenly divided by the work group size
you'll end up with a few extra threads at the end that you must kill at the
beginning of your kernel with
if (get_global_id(0) >= SIZE)
return;
Otherwise you'll end up with a buffer overflow.
On 27 Aug 2014 07:46, "CRV§ADER//KY" <crusaderky(a)gmail.com> wrote:
> You are running 1 kernel with SIZE total threads and THREADS work group
> size. The work group size is only about being able to share local memory
> and doing barrier synchronisation. If your simulation needs neither, then
> the work group size can be whatever gives you top performance (depends on
> the hardware).
>
> On 26 Aug 2014 23:54, "Joe Haywood" <haywoojr(a)mercyhealth.com> wrote:
>
>> When I launch a kernel using name(que, (SIZE,),(THREADS,),...) how does
>> this get interpreted? For example, if SIZE is 1e7 and THREADS is 64 does
>> opencl eventually launch 1e7 kernels with 64 threads per kernel? Or does it
>> launch 1e7/64 kernels? Or something entirely different?
>>
>> Ultimately, I am trying to play 1e7 games of one handed solitaire with
>> a 64 card deck. To get statistics on how many times x number of cards is
>> left. What I hope/want is the 1e7 games with 64 threads each. My answers
>> compare well to an openmp version but I'm afraid it might just be dumb luck.
>>
>> Any help is appreciated.
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net
>> http://lists.tiker.net/listinfo/pyopencl
>>
>>
Forgot to do reply to all
On 27 Aug 2014 07:46, "CRV§ADER//KY" <crusaderky(a)gmail.com> wrote:
> You are running 1 kernel with SIZE total threads and THREADS work group
> size. The work group size is only about being able to share local memory
> and doing barrier synchronisation. If your simulation needs neither, then
> the work group size can be whatever gives you top performance (depends on
> the hardware).
>
> On 26 Aug 2014 23:54, "Joe Haywood" <haywoojr(a)mercyhealth.com> wrote:
>
>> When I launch a kernel using name(que, (SIZE,),(THREADS,),...) how does
>> this get interpreted? For example, if SIZE is 1e7 and THREADS is 64 does
>> opencl eventually launch 1e7 kernels with 64 threads per kernel? Or does it
>> launch 1e7/64 kernels? Or something entirely different?
>>
>> Ultimately, I am trying to play 1e7 games of one handed solitaire with
>> a 64 card deck. To get statistics on how many times x number of cards is
>> left. What I hope/want is the 1e7 games with 64 threads each. My answers
>> compare well to an openmp version but I'm afraid it might just be dumb luck.
>>
>> Any help is appreciated.
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net
>> http://lists.tiker.net/listinfo/pyopencl
>>
>>
When I launch a kernel using name(que, (SIZE,),(THREADS,),...) how does this get interpreted? For example, if SIZE is 1e7 and THREADS is 64 does opencl eventually launch 1e7 kernels with 64 threads per kernel? Or does it launch 1e7/64 kernels? Or something entirely different?
Ultimately, I am trying to play 1e7 games of one handed solitaire with a 64 card deck. To get statistics on how many times x number of cards is left. What I hope/want is the 1e7 games with 64 threads each. My answers compare well to an openmp version but I'm afraid it might just be dumb luck.
Any help is appreciated.
Dear all,
we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
training points.
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> pyopencl-2013.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py",
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
Host-Python-File
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem[0], 4 bytes cmem[3]
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
mistake.
We would be very grateful for every hint.
Best regards,
Justin Heinermann,
University Oldenburg
"CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
> OpenCL looks like something that marries fantastically with the whole
> greenlet single-threaded paradigm of Tornado. However I couldn't find
> enough support in PyOpenCL.
>
> Currently, the only way AFAIK to make PyOpenCL work with Tornado is:
>
>
> import concurrent.futures
> from tornado.gen import coroutine
> executor = concurrent.futures.ThreadPoolExecutor(NUM_THREADS)
>
> class CLAsyncHandler(RequestHandler):
> @coroutine
> def get(self):
> event1 = <PyOpenCL host-device copy>
> event2 = <PyOpenCL kernel, wait_for=[event1]>
> event3 = <PyOpenCL device-host copy, wait_for=[event2]>
> yield executor.submit(event3.wait)
> <omissis: do something with the returned host buffer>
>
>
> which is quite horrible from a performance point of view.
> From what I understood so far from the Tornado docs, all that would
> need to change in PyOpenCL to make this thread-less is to enrich
> pyopencl.Event with the same API as concurrent.futures.Future:
> https://docs.python.org/dev/library/concurrent.futures.html#future-objects
> The only non-trivial method is add_done_callback(), which would
> internally need to call clSetEventCallback, which I believe is not
> currently in PyOpenCL at all?
>
> The above code would become
>
> from tornado.gen import coroutine
>
> class CLAsyncHandler(RequestHandler):
> @coroutine
> def get(self):
> event1 = <PyOpenCL host-device copy>
> event2 = <PyOpenCL kernel, wait_for=[event1]>
> yield <PyOpenCL device-host copy, wait_for=[event2]>
> <omissis: do something with the returned host buffer>
>
>
> As a bonus but not fundamental point, it would be nice if the output
> of the result() method of a device-to-copy Event was the output host
> buffer.
Correct, clSetEventCallback is not currently wrapped. In the past,
handling asynchronous callbacks from undetermined thread contexts in
Python was a bit of a nightmare that I decided to not get into. Python
2.7 has actually made this a lot better by adding Py_AddPendingCall(),
so I'd be willing to consider this, and it wouldn't even be terribly
hard.
On the cffi end, I'm not completely sure that cffi's callback mechanisms
are safe when they're called at arbitrary, asynchronous times, but they
at least don't mention that they *aren't*.
Andreas
Hi,
I started designing a Tornado RPC server that, internally, would
use PyOpenCL.
OpenCL looks like something that marries fantastically with the whole
greenlet single-threaded paradigm of Tornado. However I couldn't find
enough support in PyOpenCL.
Currently, the only way AFAIK to make PyOpenCL work with Tornado is:
import concurrent.futures
from tornado.gen import coroutine
executor = concurrent.futures.ThreadPoolExecutor(NUM_THREADS)
class CLAsyncHandler(RequestHandler):
@coroutine
def get(self):
event1 = <PyOpenCL host-device copy>
event2 = <PyOpenCL kernel, wait_for=[event1]>
event3 = <PyOpenCL device-host copy, wait_for=[event2]>
yield executor.submit(event3.wait)
<omissis: do something with the returned host buffer>
which is quite horrible from a performance point of view.
>From what I understood so far from the Tornado docs, all that would
need to change in PyOpenCL to make this thread-less is to enrich
pyopencl.Event with the same API as concurrent.futures.Future:
https://docs.python.org/dev/library/concurrent.futures.html#future-objects
The only non-trivial method is add_done_callback(), which would
internally need to call clSetEventCallback, which I believe is not
currently in PyOpenCL at all?
The above code would become
from tornado.gen import coroutine
class CLAsyncHandler(RequestHandler):
@coroutine
def get(self):
event1 = <PyOpenCL host-device copy>
event2 = <PyOpenCL kernel, wait_for=[event1]>
yield <PyOpenCL device-host copy, wait_for=[event2]>
<omissis: do something with the returned host buffer>
As a bonus but not fundamental point, it would be nice if the output
of the result() method of a device-to-copy Event was the output host
buffer.
Let me know your thoughts
Hi *,
is there a possibility to install python-opencl from apt with GL
enabled? (disabled by default) I thought I saw something like this on a
message board somewhere, but am unable to find it again. It went
something like
CL_ENABLE_GL=true apt-get install python-pyopencl
or so. Background is that I want others to install pyopencl with
ENABLE_GL without having to go through git/build.
Kind regards,
Kai
Hi PyOpenCLers,
I ran into an unexpected problem when using the multiprocessing library
with pyopencl.
My intent was to create a multi-process application, where an initial
python process uses multiprocessing.Process to create a number of
sub-processes, of which just one uses pyopencl.
The unexpected problem I found was that if the initial process imports
pyopencl, that interferes with the subprocess using pyopencl, at the
point that the command queue is created. The error is "RuntimeError:
CommandQueue failed: out of host memory". Here's a test program that
illustrates the problem.
---------------------- Cut -------------------------------------------
import multiprocessing
# Comment out to avoid "out of host memory" error.
import pyopencl
def main():
proc = multiprocessing.Process(target=cl_setup)
proc.start()
def cl_setup():
import pyopencl as cl
platform = cl.get_platforms()[0]
# Change GPU to CPU to avoid "out of host memory" error.
devices = platform.get_devices(cl.device_type.GPU)
ctx = cl.Context(devices)
device = devices[0]
cmd_q = cl.CommandQueue(ctx, device)
print "Success. Your cmd_q is %s." % cmd_q
if __name__ == '__main__':
main()
---------------------- Cut -------------------------------------------
Note that the root process just imports pyopencl. It never references
it, but something about the import causes a problem for the sub-process
that actually wants to do pyopencl things.
I can work around this problem by not importing pyopencl in the root
process, but I thought it was unexpected enough that I should report it.
Interestingly, this problem does not occur if I use cl.device_type.CPU
instead of cl.device_type.GPU.
I looked through pyopencl/__init__.py but didn't see anything obvious
that would cause this behavior.
My environment is Ubuntu 14.04, AMD APP 1445.5, Python 2.7.6, and the
latest from pyopencl's master branch (commit 2382347).
Finally, here's an actual traceback of the error:
[orion]brix:/brix/src/orion/src$ python cl_mp.py
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "cl_mp.py", line 19, in cl_setup
cmd_q = cl.CommandQueue(ctx, device)
RuntimeError: CommandQueue failed: out of host memory
[orion]brix:/brix/src/orion/src$ python
Mark