Hello,
As far as I understand, this is somehow connected with pycuda::complex
having default constructor. When you write (in reduction kernel):
__shared__ out_type sdata[BLOCK_SIZE];
and "out_type" has the default constructor, it is called from _every_
thread for _every_ value of the array. The reduction kernel starts
like:
__shared__ out_type sdata[BLOCK_SIZE];
... collecting values ...
sdata[tid] = <some value>
which means that there is a race between the default constructor and
actual initialization. I am not sure why nvcc does not complain about
this (bug?), but the following things do help:
1. Using "extern __shared__ out_type sdata[]" and set the size of
shared memory when preparing the kernel.
or
2. Putting "__syncthreads()" between default initialization and actual
initialization (not very good, since it leaves all those calls to
default constructor, but still removes the symptom).
I googled a bit, but could not find any actual rules about default
constructors and shared memory.
Best regards,
Bogdan
On Fri, Jan 27, 2012 at 4:19 AM, Andreas Kloeckner
<lists(a)informa.tiker.net> wrote:
Hi Jesse,
On Wed, 25 Jan 2012 14:51:05 -0800, Jesse Lu <jesselu(a)stanford.edu> wrote:
The attached script shows highly inconsistent
results (> 10% error at
times) between the numpy and gpuarray dot products. The inconsistent
results seem to only appear for large gpuarrays of data type complex64 or
complex128. Any ideas on what's going on? Thanks!
I can reproduce the issue, and I'll try to see what's behind
it. Unfortunately, I have a lot of stuff to do at the moment, so I can't
give you an ETA. Of course, I'd also appreciate any help in getting this
debugged--from anyone on the list! :)
Andreas
_______________________________________________
PyCUDA mailing list
PyCUDA(a)tiker.net
http://lists.tiker.net/listinfo/pycuda