On Mon, 21 Feb 2011 22:27:04 +0530, nithin s <nithin19484(a)gmail.com> wrote:
I believe there are some errors in the
basing my comments only on the exclusive version.
The final call to finish adds the "each" of the partial sums to
every element of the result. That is to say that if my array size was
1024x1024 and each thread block worked on 1024 elements. My partial
sum array would be as large as 1024 and the last(or second to last)
block would have to iterate 1024 sums to produce the result.
Isn't this wrong? shouldn't the partial sums be prefix scanned
and then each block adds the associated partial sum o/p to each of its
elements. That way the loop for (int i = 1; i <= blockIdx.x; i++) is
We know it's broken at the moment--that's why it's currently living on a
branch and not in mainline PyCUDA yet. Patches welcome.