Am 31.10.2017 um 03:37 schrieb Seth Thompson
I have been trying to debug the following test kernel and I get an out of resources error
on my Nvidia GTX 560M (its old!). When I switch to the Intel CPU, the out of resources
error goes away but the code takes a very long time to return from the wait command. What
I can see on the forums it seems there are three possibilities:
1. A mis-referenced pointer (I tested it on a smaller problem and no errors)
2. An overrun on an array (I tested it on a smaller problem and it gives correct
3. A watch dog timer is being tripped (~5 sec) for Nvidia
I tested your code with an Nvidia GT 750M with 2Gb of memory on macOS without problem,
total run time about 1s.
Could it be that your memory resources of your GPU are exhausted? (Often you are not
allowed to allocate all the installed memory)
I do have an atomic add command in the kernel, which can cause a
slowdown. I just don't think it would be slow enough to trip a timer but honestly I
don't know and need a second pair of eyes. This also could be tied to a fundamental
misconception on my part:
I am using one queue. If I understand it correctly each kernel call will be run in order
of the queue and call from the host. I added a wait event to my kernel call, when I remove
the wait command from my kernel call I get dramatic speed up in kernel run time. However
I cannot retrieve the data from the gpu if I remove the wait command (as if the kernel is
still running but the host has been returned control of program flow). This slow data
retrieval leads me to needing either a wait or finish command but both of these are slow.
This brings me back around to the out of resources error. Where am I going
wrong/misunderstanding? Does control to the host get returned prior to the completion of a
kernel call? If it does not why would it take a long time for the data to be returned to
the host in an .get() call? Do I need a wait/finish command?
Calling a kernel immediately returns, kernel execution takes place asynchronously
to the host control program. But transferring data with .get() waits for the completion of
the transfer (and all preceding tasks in the queue), thus it appears slow if you don’t
wait before for the kernel execution to finish.
I have attached the python function and kernel files. I apologize if
this has been answered elsewhere but I have searched and come up empty, of course I maybe
searching for the wrong things.
Thanks & Regards
PyOpenCL mailing list