Wow.. thats quite a bit of work. .. though I must admit on first
inspection of the kernel code it was evident that as the kernel goes
up and down tree there is a great deal of waiting.
Any how I don't quite have the energy to get another version working
in the near future. I'm leaving it to the community to use the code or
@Andreas: I've attached an improved version with a few bug fixes.
On 22 February 2011 23:24, Bryan Catanzaro <catanzar(a)eecs.berkeley.edu> wrote:
There has been lots of work on Parallel Prefix Sum on
difference between implementations of Inclusive and Exclusive prefix
sum is very minor, and both can be implemented efficiently.
Focusing on work efficiency for GPUs is actually a distracting red
herring, especially when you're using a fully parallel recursive
parallel Prefix Sum, as you are doing. This is for two reasons:
1. GPUs are SIMD machines, so work efficiency as counted by the PRAM
model does not always correlate to saving work on the actual hardware.
2. The work involved in synchronizing and coordinating threads is not
counted as "work" in traditional analyses. On GPUs, floating-point
operations are essentially free. Communication and coordination
between parallel threads, however, is very expensive. Instead of
doing a fully parallel recursive parallel prefix sum, it's much more
efficient to sequentialize the sum as much as possible, expressing
only enough parallelism to fill the chip.
It's also advantageous to take advantage of commutativity - have you
defined whether your scan implementation accepts associative but
non-commutative operators? If you require operators to be both
associative and commutative, you can improve performance
significantly. I wrote up some things about reductions (which are
related to scan) here:
As you can see, the "2-phase" reduction kernels are much faster than
the "recursive multi-phase" kernels. Your implementation uses the
recursive multi-phase approach. There are two analogues of the
2-phase reduction kernel for scan on an n element array with p
processors (thread blocks), both of which require exactly 3 kernel
launches, regardless of n or p:
* Reduce, Scan, Scan: Divide the input into blocks of n/p elements.
Perform reductions on these blocks in parallel, write out the sum of
each block to a temporary array. Perform a scan on the temporary
array (this should only require 1 thread block and one kernel, since
there are only p elements in the temporary array). Perform a scan on
the input, this time starting each block with the appropriate result
from the scan.
* Scan, Scan, Add: Scan the blocks in parallel, write the results out.
Perform a scan on the first result from each block (this should only
require 1 thread block and one kernel launch, since there are only p
blocks. Broadcast the appropriate result from the partial scan and
perform vector addition in each block.
If you sequentialize these operations as much as possible, and do
parallel reduction/scan trees only when absolutely necessary, you will
save a lot of synchronization and communication cost. You can take a
look at Thrust  to see an example of how this can be done.
Best of luck,
 Thrust: http://code.google.com/p/thrust
On Tue, Feb 22, 2011 at 1:47 AM, nithin s <nithin19484(a)gmail.com> wrote:
On 22 February 2011 05:45, Andreas Kloeckner <lists(a)informa.tiker.net> wrote:
- can you please resend this as an attachment? It's hard to fish out of
the text of an email.
- please avoid using floating point functions (log, ceil, floor) in
integer contexts. PyCUDA comes with a bitlog2 function that does what
you need, I think.
bitlog2 alone doesn't cut it. This is becase the log is taken to the
base 2*block_size. block_size need not be a power of 2 in a few rare
cases. This is because if shared mem is limited then the block_size =
shared_mem/item_size. Now Item size need not be a power of 2 (If we
are willing to support arbitrary types.. though there is a limitation
.. since dtype needs to be known for partial sum array
allocations..which is presumably numpy.dtype).
This will mess up the estimate. I could recode this by writing a
routine by repeatedly dividing and calculating the necessary int ciel.
I feel the current expression is cleaner and concise. Let me know if
you still feel otherwise.
Once I get the file posted on the PyCUDA branch, I'll write a more
complete review. I agree with your assessment of inclusive vs exclusive
scan. I'd say go ahead and kill the inclusive version.
Tomasz, what's your take here?
@Bryan: Tomaszs' original inclusive scan impl was based on the naive
prefix scan algorithm at http://en.wikipedia.org/wiki/Prefix_sum
is not particularly work efficient. I don't (yet) see a neat way to
convert the Exclusive Mark Harris version to an inclusive one.Thus I
thought it better to maintain a single exclusive prefix scan.
PyCUDA mailing list