Sorry to open that issue once again, but I found new information, I
wanted to share:
The described behaviour only arises, when the workgroup-size is set to
I'm still not sure, if this is reproducable.
Am 2016-04-06 21:06, schrieb Schock, Jonathan:
I am becoming more and more sure, that the problem is based on wrong
The kernel seems to run in correct order with the profiler not being
able to show that, because
the runtimes are too short. If I increase the runtime of the kernel
(e.g. by doubling the amount
of data to crunch on), the anomaly seems to vanish. I only ran into
the strange behaviour
again if the kernel runtime is below ~500 µs.
I'm nevertheless interested in the reason, this is happening.
Am 2016-04-05 21:59, schrieb Andreas Kloeckner:
> Hi Jonathan,
> "Schock, Jonathan" <jonathan.schock(a)tum.de> writes:
>> The important bits that seem to the behaviour are:
>> import pyopencl as cl
>> import numpy as np
>> platform = cl.get_platforms()
>> devs = platform.get_devices()
>> device1 = devs
>> h_data =
>> ctx = cl.Context([device1])
>> queue = cl.CommandQueue(ctx)
>> queue2 = cl.CommandQueue(ctx)
>> f = open('Kernel.cl', 'r')
>> fstr = "".join(f.readlines())
>> prg = cl.Program(ctx, fstr).build()
>> d_image = cl.Image(ctx, mf.READ_ONLY,
>> wev1 = cl.enqueue_copy(queue, d_image, h_data, is_blocking=False,
>> origin=(0,0,0), region=h_image_shape)
>> prg.sum(queue2,(h_image_shape,),None,d_image,wait_for = [wev1])
>> The Kernel is doing some simple number crunching on the input image.
>> I'm measuring with nvvp, one result is attached, where you clearly
>> that the kernel launches long before the copy has ended.
>> I allready tested with OoO disabled...same behaviour.
>> Implementation is 'Tesla K10.G2.8GB' on 'NVIDIA CUDA'.
> For completeness: What driver version is this?
> I am pretty certain that what you're seeing is non-compliant behavior
> the Nv implementation. The behavior you're seeing is consistent with
> just ignoring the event_wait_list.
> Beyond that, using two queues *and* ooo is redundant. ooo alone will
> already allow concurrency.
> I would probably recommend using a single queue until overlapping
> becomes crucial.