Hi Bogdan,
On Fri, 27 Jan 2012 13:27:04 +1100, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
Hello,
As far as I understand, this is somehow connected with pycuda::complex
having default constructor. When you write (in reduction kernel):
__shared__ out_type sdata[BLOCK_SIZE];
and "out_type" has the default constructor, it is called from _every_
thread for _every_ value of the array. The reduction kernel starts
like:
__shared__ out_type sdata[BLOCK_SIZE];
... collecting values ...
sdata[tid] = <some value>
which means that there is a race between the default constructor and
actual initialization. I am not sure why nvcc does not complain about
this (bug?), but the following things do help:
1. Using "extern __shared__ out_type sdata[]" and set the size of
shared memory when preparing the kernel.
or
2. Putting "__syncthreads()" between default initialization and actual
initialization (not very good, since it leaves all those calls to
default constructor, but still removes the symptom).
I googled a bit, but could not find any actual rules about default
constructors and shared memory.
Thanks for this insight! I don't think this would've occurred to me just
From staring at the code. :) Indeed, inserting __syncthreads() after the
shared array declaration brings the error down to more reasonable values
for me. Jesse, my recommendation would be to use that as a workaround
while we figure out a more permanent fix.
I just searched as well, but couldn't find anything. I've pinged Nathan
Bell at Nvidia (coauthor of thrust) to see what he thinks about this.
Andreas