Bryan Catanzaro <bcatanzaro(a)acm.org> writes:
I agree that data size matters in these discussions.
But I think the
right way to account for it is show performance at a range of data
sizes, as measured from Python.
The assumption that you'll keep the GPU busy isn't necessarily true.
thrust::reduce, for example (which max_element uses internally),
launches a big kernel, followed by a small kernel to finish the
reduction tree, followed by a cudaMemcpy to transfer the result back
to the host. The GPU won't be busy during the small kernel, nor
during the cudaMemcpy, nor during the conversion back to Python, etc.
Reduce is often used to make control flow decisions in optimization
loops, where you don't know what the next optimization step to be
performed is until the result is known, and so you can't launch the
work speculatively. If the control flow is performed in Python, all
these overheads are exposed to application performance - so I think
Glad you brought that up. :) The conjugate gradient solver in PyCUDA
addresses exactly that by simply running iterations as fast as it can
and shepherding the residual results to the host on their own time,
deferring convergence decisions until the data is available. That was
good for a 20% or so gain last time I measured it (on a GT200).