On Mar 15, 2010, at 8:38 AM, Andreas Klöckner wrote:
On Sonntag 14 März 2010, Brian Cole wrote:
I really like most of the design decisions made in the PyOpenCL wrapper
ings). I'm trying to understand the reasoning of the last point though:
"If an operation involves two or more “complex” objects (like e.g. a kernel
enqueue involves a kernel and a queue), refuse the temptation to guess
which one should get a method for the operation. Instead, simply leave
that command to be a function."
The reason I ask is this appears the crucial difference between the
PyOpenCL bindings and the C++ bindings available on the Khronos website.
It doesn't appear either API has really encapsulated the complex
relationship between kernels and queues. To be specific:
- The C++ bindings rely on cl::Kernel::bind to explicitly bind a kernel to
a command queue, returning a KernelFunctor object to represent this
relationship. We've already seen that this is confusing to people since
this leads people to bind the same kernel to multiple command queues.
- PyOpenCL tries to avoid this by making Kernel.__call__ a one stop shop
enqueuing of kernels. The result is a long argument list that is difficult
to remember. My experience has been that long argument lists are confusing
and error prone. Furthermore, this design is not readily transferable to
clEnqueueTask since the user is always required to specify a global
Hmm. As a middle ground between the two, one could add a "KernelBinding"
or "BoundKernel" object that encapsulates queue (and perhaps local_size)
info and thereby shortens the arg list by two entries.
I agree, the statefulness is exactly what is bothering me. But unfortunately, kernel
objects are by their very nature stateful since they hold arguments. Even in PyOpenCL the
same kernel object should not be launched from multiple threads.
This is where PyOpenCL shines, it's really easy to get a new kernel for each thread by
using Program.kernel_name. Using the RAII principle, what if the something similar was
done to return a "BoundKernel" object.
Program.get_bound_kernel(name, queue, global=None, local=None)
The BoundKernel object then has getters and setters for global and local size if they need
to be tweaked later instead of at construction (if they are left None it is a task).
Though introducing another class breaks the zen of python, "There should be one-- and
preferably only one --obvious way to do it."
It is interesting to note the difference between the two APIs about where the global and
local sizes should be specified. The C++ is designed to make it easy to reuse the same
global and local size over and over again. Where PyOpenCL allows the user to redefine the
global size at every invocation, though it makes the local size second class citizen that
should be ignored for the most part.
On the other hand, I'm not sure the Kernel.__call__ signature is
excessively long. There are four things that the user needs to remember
to put there--queue, global_size, his arguments, and local_size. I don't
like stateful interfaces (such as .bind()) for their multi-threading
implications, and OpenCL (as opposed to CUDA) seems to avoid them--so
I'd rather not reintroduce them in a wrapper. Also I find that Python's
keyword arguments help with long arg lists--though not really with
My production kernel (will be sending a beta release of it to a customer soon) has 12
arguments. Unfortunately, kernels which do fairly complex operations have fairly complex
argument lists. Anything I can do to cut this number down will help with maintainability
in the long run. Though I'm curious whether anyone has tried abstracting arguments
into a struct? Would this sort of thing work? Even cooler, could the struct be
automatically converted from a python keyword argument list?
Regarding clEnqueueTask, because of
clEnqueueTask is equivalent to calling clEnqueueNDRangeKernel with
work_dim = 1, global_work_offset = NULL, global_work_size set to 1
and local_work_size set to 1.
it is safe to spell it
knl(queue, (1,), *args, local_size=(1,))
especially since task parallelism doesn't really seem to be such a
prominent feature of CL.
Fair enough, Intel is the only major company to _not_ release an OpenCL implementation.
But why not shift the location of the global size parameter?
__call__(queue, *args, global_size=None, global_offset=None, local_size=None,
This places like arguments together and allows global_size to be None indicating a task
In summary, I'm not sure I know the right answer to your inquiry, but at
the same time I'm not spectacularly unhappy with the present state of
There has to be a clever strategy out there somewhere, or perhaps OpenCL is just too
bloody flexible to be wrangled into a single OO API that works for everyone. It's why
I started the thread, to hear other opinions. :-)
Differing opinions, anyone?