It has been a while and I just wanted to let you know my final findings:
I tested the script below on different implementations and saw strange
things in NVVP.
So I decided to implement your pyopencl event system for in-depth
profiling and can now confirm,
that the problem is not in pyopencl but in NVVPs display. I extensively
tested this on
several driver-hardware combinations and it seems to hold true on all
List of test scenarios for completeness below:
367.35 'Tesla K10.G2.8GB'
367.44 'GeForce GTX TITAN'
367.35 'GeForce GTX TITAN X'
367.35 'GeForce GTX TITAN'
Am 2016-04-05 21:59, schrieb Andreas Kloeckner:
"Schock, Jonathan" <jonathan.schock(a)tum.de> writes:
The important bits that seem to the behaviour
import pyopencl as cl
import numpy as np
platform = cl.get_platforms()
devs = platform.get_devices()
device1 = devs
ctx = cl.Context([device1])
queue = cl.CommandQueue(ctx)
queue2 = cl.CommandQueue(ctx)
f = open('Kernel.cl', 'r')
fstr = "".join(f.readlines())
prg = cl.Program(ctx, fstr).build()
d_image = cl.Image(ctx, mf.READ_ONLY,
wev1 = cl.enqueue_copy(queue, d_image, h_data, is_blocking=False,
prg.sum(queue2,(h_image_shape,),None,d_image,wait_for = [wev1])
The Kernel is doing some simple number crunching on the input image.
I'm measuring with nvvp, one result is attached, where you clearly
that the kernel launches long before the copy has ended.
I allready tested with OoO disabled...same behaviour.
Implementation is 'Tesla K10.G2.8GB' on 'NVIDIA CUDA'.
For completeness: What driver version is this?
I am pretty certain that what you're seeing is non-compliant behavior
the Nv implementation. The behavior you're seeing is consistent with Nv
just ignoring the event_wait_list.
Beyond that, using two queues *and* ooo is redundant. ooo alone will
already allow concurrency.
I would probably recommend using a single queue until overlapping