"Schock, Jonathan" <jonathan.schock(a)tum.de> writes:
The important bits that seem to the behaviour are:
import pyopencl as cl
import numpy as np
platform = cl.get_platforms()
devs = platform.get_devices()
device1 = devs
ctx = cl.Context([device1])
queue = cl.CommandQueue(ctx)
queue2 = cl.CommandQueue(ctx)
f = open('Kernel.cl', 'r')
fstr = "".join(f.readlines())
prg = cl.Program(ctx, fstr).build()
d_image = cl.Image(ctx, mf.READ_ONLY,
wev1 = cl.enqueue_copy(queue, d_image, h_data, is_blocking=False,
prg.sum(queue2,(h_image_shape,),None,d_image,wait_for = [wev1])
The Kernel is doing some simple number crunching on the input image.
I'm measuring with nvvp, one result is attached, where you clearly see,
that the kernel launches long before the copy has ended.
I allready tested with OoO disabled...same behaviour.
Implementation is 'Tesla K10.G2.8GB' on 'NVIDIA CUDA'.
For completeness: What driver version is this?
I am pretty certain that what you're seeing is non-compliant behavior by
the Nv implementation. The behavior you're seeing is consistent with Nv
just ignoring the event_wait_list.
Beyond that, using two queues *and* ooo is redundant. ooo alone will already allow
I would probably recommend using a single queue until overlapping