The important bits that seem to the behaviour are:
import pyopencl as cl
import numpy as np
platform = cl.get_platforms()
devs = platform.get_devices()
device1 = devs
ctx = cl.Context([device1])
queue = cl.CommandQueue(ctx)
queue2 = cl.CommandQueue(ctx)
f = open('Kernel.cl', 'r')
fstr = "".join(f.readlines())
prg = cl.Program(ctx, fstr).build()
d_image = cl.Image(ctx, mf.READ_ONLY,
wev1 = cl.enqueue_copy(queue, d_image, h_data, is_blocking=False,
prg.sum(queue2,(h_image_shape,),None,d_image,wait_for = [wev1])
The Kernel is doing some simple number crunching on the input image.
I'm measuring with nvvp, one result is attached, where you clearly see,
that the kernel launches long before the copy has ended.
I allready tested with OoO disabled...same behaviour.
Implementation is 'Tesla K10.G2.8GB' on 'NVIDIA CUDA'.
Am 2016-04-05 15:35, schrieb Andreas Kloeckner:
I am not quite getting the function of the event system together with
I want to enqueue a non-blocking copy function which returns an event
for a kernel to wait on:
wev1 = cl.enqueue_copy(IOqueue, device_image, host_image,
is_blocking=False, origin=(0,0,0), region=host_image_shape)
Both queues are defined as OoO queues in the same context on the same
In my profiling I can see the start of the kernel, before the copying
finished. Does that mean, I have to use blocking copies,
or am I doing something else wrong?
How are you measuring? What implementation is this on? (Only Intel CPU
supports OoO queues as of now, as far as I know.) Can you show code to
reproduce? FWIW, your code snippet looks correct to me, in the sense
that the kernel should see all results of the copy.