Hi Igor -
I meant that it's more useful to know the execution time of code
running on the GPU from Python's perspective, since Python is the one
driving the work, and the execution overheads can be significant.
What timings do you get when you use timeit rather than CUDA events?
Also, what GPU are you running on?
On Wed, May 30, 2012 at 5:56 PM, Igor <rychphd(a)gmail.com> wrote:
I've updated the
larger vector, a billion elements.
As for returning the value, it's the pair of max value and position we
are talking about, thrust returns the position and I'm now timing the
extraction of the value from the gpu array which didn't change timing
ReductionKernel still appears 5 times slower than thrust.
Bryan, on the same worksheet the numpy timing is printed as well:
argmax is 3 times slower than ReductionKernel.
On Thu, May 31, 2012 at 12:08 PM, Andreas Kloeckner
On Wed, 30 May 2012 22:13:27 +1200, Igor
I'm attaching an example for your wiki demonstrating how to find a max
element position both using ReductionKernel and thrust-nvcc-ctypes.
The latter doesn't quite work on windows yet. Should work if you're on
a linux, just change the FOLDER. There is a live version published on
my sage server (http://dev.math.canterbury.ac.nz/home/pub/26/
there all work and show a discouraging 5-fold slowdown of
ReductionKernel as compared to thrust (run twice, as the .so file is
loaded lazily?). Could you take a look and edit it if necessary?
Not a fair comparison. The PyCUDA test includes the transfer of the
result to the host. (.get()) Doesn't look like that's the case for
thrust. Also, an 80 MB vector is tiny. At 200 GB/s, that's about 4e-4s,
which is in the vicinity of launch overhead.
PyCUDA mailing list