Martin Rehr <rehr(a)nbi.ku.dk> writes:
The ‘elwise_kernel_runner' in ‘pyopencl.array' seems rather
inefficient compared to the way ‘pycuda.gpuarray' uses prepared
Are there any plans on changing ‘pyopencl.array.elwise_kernel_runner'
to use ‘enqueue_nd_range_kernel' and ‘set_arg' like
‘pycuda.gpuarray' uses ‘prepared_async_call' ?
I have a few ways in mind in which this could be made more
efficient. One big one is to cache the grid/work group sizes that are
computed by pyopencl.array.splay. Another would be to rename
Kernel.set_scalar_arg_dtypes to Kernel.prepare and to have it replace
Kernel.__call__ with some generated Python code. Neither of these is
difficult. I've recently made this helper for Python code generation
which could be helpful in that effort:
I'm a bit swamped right now, but am happy to review patches in this
direction, and I'll look into addressing this as I
In addition, thanks to Marko Bencun, there's a branch of pyopencl that
works on cffi (and thereby aims at pypy), which is aimed at shrinking
Python invocation overhead as well.