I agree that data size matters in these discussions. But I think the
right way to account for it is show performance at a range of data
sizes, as measured from Python.
The assumption that you'll keep the GPU busy isn't necessarily true.
thrust::reduce, for example (which max_element uses internally),
launches a big kernel, followed by a small kernel to finish the
reduction tree, followed by a cudaMemcpy to transfer the result back
to the host. The GPU won't be busy during the small kernel, nor
during the cudaMemcpy, nor during the conversion back to Python, etc.
Reduce is often used to make control flow decisions in optimization
loops, where you don't know what the next optimization step to be
performed is until the result is known, and so you can't launch the
work speculatively. If the control flow is performed in Python, all
these overheads are exposed to application performance - so I think
The fact that they're relatively less important for larger problems
will be evident if the timings are made from the Python side.
On Wed, May 30, 2012 at 10:20 PM, Andreas Kloeckner
On Wed, 30 May 2012 21:58:13 -0700, Bryan Catanzaro
Why should the overhead be measured separately?
For users of these
systems, the Python overhead is unavoidable. The time spent running
on the GPU alone is an important implementation detail for people
improving systems like PyCUDA, but users of these systems see overhead
costs exposed in their overall application performance, and so I don't
see how the overhead can be ignored.
Because whether the overhead matters or not depends on data size. Since
the overhead is constant across all data sizes, that overhead is going
to be mostly irrelevant for big data, whereas for tiny data it might
well be a dealbreaker.
That's why I think a single number doesn't cut it.
In addition, there's an underlying assumption that you'll keep the GPU
busy for a while, i.e. keep the GPU queue saturated. If you do that (the
ability to do that being related, again, to data size), then on top of
that anything Python does runs in parallel to the GPU--and your net run
time will be exactly the same as if the overhead never happened.