I really like most of the design decisions made in the PyOpenCL wrapper (http://documen.tician.de/pyopencl/misc.html#relation-with-opencl-s-c-bindin…). I'm trying to understand the reasoning of the last point though:
"If an operation involves two or more “complex” objects (like e.g. a kernel enqueue involves a kernel and a queue), refuse the temptation to guess which one should get a method for the operation. Instead, simply leave that command to be a function."
The reason I ask is this appears the crucial difference between the PyOpenCL bindings and the C++ bindings available on the Khronos website. It doesn't appear either API has really encapsulated the complex relationship between kernels and queues. To be specific:
- The C++ bindings rely on cl::Kernel::bind to explicitly bind a kernel to a command queue, returning a KernelFunctor object to represent this relationship. We've already seen that this is confusing to people since this leads people to bind the same kernel to multiple command queues.
- PyOpenCL tries to avoid this by making Kernel.__call__ a one stop shop for enqueuing of kernels. The result is a long argument list that is difficult to remember. My experience has been that long argument lists are confusing and error prone. Furthermore, this design is not readily transferable to clEnqueueTask since the user is always required to specify a global dimension.
I don't mean to start a flame war, just a fruitful discussion that will hopefully benefit both APIs. The rumor is the C++ bindings are being considered for inclusion in the OpenCL 1.1 specification, so we have one chance to get them correct.
I have a PC with 2 R4870 video cards running in Linux. Using pyOpenCL,
I can run a program in either GPU, but when I try to run 2
simultaneous kernels (one in each card), it seems that in order to the
second queued kernel to run, the first one queued must be finished.
I'm expecting that I can queue 2 instances of the same kernel, one in
each GPU, and that the total running time should be roughly the same
as if I run only one instance, but this is not happening. I have tried
using a queue.flush() after queuing each kernel, but the running time
still the same.
Is it possible to use both (multiple) GPUS simultaneously in PyOpenCL?
How can this be done?
I've just gotten my hands on a GTX 260 so I'm experimenting with
pyOpenCL again. I've picked up working on a previous example that
uses a red-black 4-point Gauss-Siedel (parallelization of
http://www.scipy.org/PerformancePython). I've noticed that it doesn't
seem like we have access to clEnqueueBarrier, and I'm wondering what
might be the most efficient way to do something similar, or whether
adding clEnqueueBarrier might be a good idea. Using
enqueue_wait_for_events and waiting on a single previous event seems
to be faster than .wait() (presumably because it doesn't have to
switch back and forth between the interpreter and the OpenCL
implementation as much), is this the recommended method for the
Phone: (847) 448-0386