I believe there are some errors in the implementation. Im
basing my comments only on the exclusive version.
The final call to finish adds the "each" of the partial sums to
every element of the result. That is to say that if my array size was
1024x1024 and each thread block worked on 1024 elements. My partial
sum array would be as large as 1024 and the last(or second to last)
block would have to iterate 1024 sums to produce the result.
Isn't this wrong? shouldn't the partial sums be prefix scanned
and then each block adds the associated partial sum o/p to each of its
elements. That way the loop for (int i = 1; i <= blockIdx.x; i++) is
PS: Please feel free to ignore if this has already been observed. Do
let me know though..:)