Thanks for the quick reply — that fixed it.
Just out of curiosity, is there a compelling reason not to cache the kernel code in the
program objects, and then quickly returning repeated calls? I generally wouldn’t expect
calling a method repeatedly to be significant slower than getting a copy and then calling
it. I guess what you’re saying is that I shouldn’t think of “pgm.sum” as a method, but
rather as an argument-less function that returns a method? In this case shouldn’t I
expect the syntax to be “sum_knl = prg.sum()”?
PS: Thanks for writing pyopencl — it has made my life much easier!
On Feb 13, 2016, at 6:42 PM, Andreas Kloeckner
Dustin Kleckner <dustinkleckner(a)gmail.com> writes:
> I’ve been using pyopencl for awhile for various simulation/data processing tasks. I
recently upgraded to a new computer, and noticed things were considerably slower.
> After some experimentation, I tracked this down to the version of pyopencl I was
using. The updated version (2015.2.4; most recent on pypi) takes significantly longer to
queue a function call (~1.5 ms) than the old version (2015.1, ~0.03 ms). Both times come
from the same machine*. Profiling indicates that the newer version is making lots of
function calls the old version did not. FYI, the code I used to test this is below
(adapted from documentation).
> For my purposes, this is slightly alarming: my code makes lots of kernel calls, in
which case the new version is 50x slower for small data sets!
> Is this something that has been/will be fixed in newer versions of pyopencl? Is
there a workaround? Of course, for the time being I can use the old version, but I’d
rather not be stuck with it.
> If needed, I can provide the profiler output.
tl;dr: Hang on to the kernel object, i.e. 'sum_knl = prg.sum'. It's used
for caching stuff.
PyOpenCL 2015.2 generates custom Python code to make kernel invocation
*faster* (not slower). Generating this code (which gets attached to the
kernel object, prg.sum) takes time, and every time you call 'prg.sum',
you get a new kernel object. So you're likely mainly benchmarking the
generation (and compilation) of the invoker code.