James Bergstra <james.bergstra(a)gmail.com> writes:
Hi, I have written an opencl program that involves relatively small
kernels. For a certain benchmarking script, I have added up the time used
by kernels as 0.06 seconds, while the tightest python loop I can think of
still requires .2 seconds to execute the 5000-or-so kernel calls. The
program involves repeatedly looping through the same kernels, with the same
arguments, so I was wondering if there was a way to enqueue several nd
range kernels at once, at least from Python's perspective. Is there such a
In other words, supposing I have kernels A and B, taking arguments x and y,
my program consists of:
A(x); B(y); A(x); B(y); ....
Ideally, I would like to enqueue 100 copies of the kernel sequence [(A, x),
(B, y)], but being able to enqueue even [(A, x), (B, y)] with one call
instead of 2 could be a big help.
What you're saying is that Kernel.__call__ is too slow for your current
First off, it'd be great if you could take a look at Kernel.set_args:
to see if there's any fat that could be trimmed from your
perspective. I've tried to keep this code path as quick as I could, but
there might be something I've overlooked.
Next, if there's nothing to be had in that direction, you can simply
call Kernel.set_args once and then repeatedly call
cl.enqueue_nd_range_kernel() as done in Kernel.__call__ (see source link
above). That should get reasonably close to the rate that the OpenCL API
itself can sustain.
Hope that helps,