Am attaching an updated version.
@Andreas: the floating pt ops are not part of the code anymore.
Some code beautification and consistent nomenclature.
On 22 February 2011 15:17, nithin s <nithin19484(a)gmail.com> wrote:
On 22 February 2011 05:45, Andreas Kloeckner <lists(a)informa.tiker.net> wrote:
- can you please resend this as an attachment? It's hard to fish out of
the text of an email.
- please avoid using floating point functions (log, ceil, floor) in
integer contexts. PyCUDA comes with a bitlog2 function that does what
you need, I think.
bitlog2 alone doesn't cut it. This is becase the log is taken to the
base 2*block_size. block_size need not be a power of 2 in a few rare
cases. This is because if shared mem is limited then the block_size =
shared_mem/item_size. Now Item size need not be a power of 2 (If we
are willing to support arbitrary types.. though there is a limitation
.. since dtype needs to be known for partial sum array
allocations..which is presumably numpy.dtype).
This will mess up the estimate. I could recode this by writing a
routine by repeatedly dividing and calculating the necessary int ciel.
I feel the current expression is cleaner and concise. Let me know if
you still feel otherwise.
Once I get the file posted on the PyCUDA branch, I'll write a more
complete review. I agree with your assessment of inclusive vs exclusive
scan. I'd say go ahead and kill the inclusive version.
Tomasz, what's your take here?
@Bryan: Tomaszs' original inclusive scan impl was based on the naive
prefix scan algorithm at http://en.wikipedia.org/wiki/Prefix_sum
is not particularly work efficient. I don't (yet) see a neat way to
convert the Exclusive Mark Harris version to an inclusive one.Thus I
thought it better to maintain a single exclusive prefix scan.