On Thu, 3 Mar 2011 16:50:29 +0100, Magnus Paulsson <paulsson.m(a)gmail.com> wrote:
I have updated the code to reuse allocated memory both
on the device
and host page-locked memory (since I think those allocations were the
cause of the blocking calls).
The streamed version is now faster than the "serial" code. However, I
still think that the speed increase is simply due to faster mem-copies
(from/to page-locked memory) and not from any overlap between stream 1
and 2. At least if I trust the visual profiler.
So ... I'm I doing things wrong?
Looking at your streams.py code, I'm wondering why you're expecting
things to run in parallel if your synchronizing with both stream1 and
stream2 after you're done with each of them? Wouldn't that explicitly
prevent any parallelism between them?
What am I missing?