larger vector, a billion elements.
As for returning the value, it's the pair of max value and position we
are talking about, thrust returns the position and I'm now timing the
extraction of the value from the gpu array which didn't change timing
too much.
ReductionKernel still appears 5 times slower than thrust.
Bryan, on the same worksheet the numpy timing is printed as well:
argmax is 3 times slower than ReductionKernel.
On Thu, May 31, 2012 at 12:08 PM, Andreas Kloeckner
<lists(a)informa.tiker.net> wrote:
On Wed, 30 May 2012 22:13:27 +1200, Igor
<rychphd(a)gmail.com> wrote:
Hi Andreas,
I'm attaching an example for your wiki demonstrating how to find a max
element position both using ReductionKernel and thrust-nvcc-ctypes.
The latter doesn't quite work on windows yet. Should work if you're on
a linux, just change the FOLDER. There is a live version published on
my sage server (
http://dev.math.canterbury.ac.nz/home/pub/26/ ) --
there all work and show a discouraging 5-fold slowdown of
ReductionKernel as compared to thrust (run twice, as the .so file is
loaded lazily?). Could you take a look and edit it if necessary?
Not a fair comparison. The PyCUDA test includes the transfer of the
result to the host. (.get()) Doesn't look like that's the case for
thrust. Also, an 80 MB vector is tiny. At 200 GB/s, that's about 4e-4s,
which is in the vicinity of launch overhead.
Andreas