I found the solution to my problem. As soon as I assign a new variable name (e.g. I copy
two more images with different names)
these copies run in parallel with kernel execution. So it has nothing to do with the
images being images but seems to be more a variable
reference issue inside python (and probably therefore in pyopencl). I haven't tried
something equal with C bindings, but I assume that the
same problems should not apply.
I have not yet tried to leverage multiple copy engines on one card, but this should also
So pseudocode for working and non-working examples:
for i in range(10):
image1 = cl.Image(...)
Does not work in parallel.
image1 = cl.Image(...)
image2 = cl.Image(...)
DOES work in parallel. This seems to be rather a feature than a bug to me, because it
could lead to unexpected behaviour if the copy in the first example would run during the
execution of the kernel accessing the exact same image.
Msc. Jonathan Schock
Chair for Biomedical Physics
Technische Universität München
Boltzmannstr. 11, 85748 Garching
Phone: +49 89 289 10846
Von: Andreas Kloeckner <lists(a)informa.tiker.net>
Gesendet: Mittwoch, 25. Oktober 2017 22:58:25
An: Schock, Jonathan; pyopencl(a)tiker.net
Betreff: Re: [PyOpenCL] Overlapped Copy and execution of kernels on Nvidia Devices
"Schock, Jonathan" <jonathan.schock(a)tum.de> writes:
I have again a problem on my Nvidia graphics cards an
pyopencl. I wrote a simple kernel, that computes the (pixelwise) median
of an image and outputs that to another image. I also wrote an easy event visualiser in
to have an idea about execution time.
I have a simple setup, where I have one context on one device with two queues.
I enqueue my copies to one queue and my kernel execution to the other queue, where
the copies are associated with an event for the kernel to wait for. What I expect is,
that when I repeat this process several times, the copies should be executed in parallel
to the kernel execution on the other queue. What I see is, that during kernel execution,
there is no parallel work on the other queue.
Is this a problem of my code or of Nvidias implementation?
Two issues I see with your code:
- I've only ever seen copies to and from buffers being overlapped, not
images. I might be
- In CUDA, async copies require page-locked host memory, which Nvidia
has mapped (somewhat nonsensically) to
(which you then need to map to host memory to get the actual CPU-side