Hi Andreas,
I know it's subject to rounding etc. but ReductionKernel often finds a
max value at a different position than both numpy.argmax and
thrust::max_element, the latter two agreeing always.
(Check out the wildly different answer from numpy.sum)
Igor
On Thu, May 31, 2012 at 6:59 PM, Andreas Kloeckner
<lists(a)informa.tiker.net> wrote:
On Thu, 31 May 2012 12:56:15 +1200, Igor
<rychphd(a)gmail.com> wrote:
I've updated the
http://dev.math.canterbury.ac.nz/home/pub/26/
larger vector, a billion elements.
As for returning the value, it's the pair of max value and position we
are talking about, thrust returns the position and I'm now timing the
extraction of the value from the gpu array which didn't change timing
too much.
ReductionKernel still appears 5 times slower than thrust.
Bryan, on the same worksheet the numpy timing is printed as well:
argmax is 3 times slower than ReductionKernel.
I've looked at this for a little while, can't quite make heads or tails
of it yet. Here's the profiler output:
method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] gridsize=[
128, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 672 ]
inst_issued=[ 7906011 ]
method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] gridsize=[ 1, 1 ]
threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] l1_shared_bank_conflict=[ 96 ]
method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] gridsize=[ 32,
1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 1255 ]
inst_issued=[ 2964333 ]
method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ 1, 1 ]
threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] l1_shared_bank_conflict=[ 18 ]
Second stages are comparable, but PyCUDA receives a sound beating in the
first stage. I don't quite understand why though. Code-wise, PyCUDA and
thrust do mostly the same thing--some parameters are different, but I've
twiddled them, and they don't make a big difference. From the profile,
the main killer seems to be that thrust's code simply issues three times
fewer instructions. But I don't get why--the codes aren't that
different.
Compare yourself:
https://code.google.com/p/thrust/source/browse/thrust/system/detail/generic…
https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/re…
https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/bl…
vs
https://github.com/inducer/pycuda/blob/b28595eb92345f561096e833062f11b89601…
I've even made a version of reduction that's even more directly like
what thrust does:
https://github.com/inducer/pycuda/blob/thrusty-reduce/pycuda/reduction.py
The timing is about the same, even a tad bit slower. I'd much appreciate
any clues. Igor, can you please check if the perf difference is the same
on just a simple sum'o'floats?
Andreas