On Thu, 31 May 2012 12:56:15 +1200, Igor <rychphd(a)gmail.com> wrote:
I've updated the
larger vector, a billion elements.
As for returning the value, it's the pair of max value and position we
are talking about, thrust returns the position and I'm now timing the
extraction of the value from the gpu array which didn't change timing
ReductionKernel still appears 5 times slower than thrust.
Bryan, on the same worksheet the numpy timing is printed as well:
argmax is 3 times slower than ReductionKernel.
I've looked at this for a little while, can't quite make heads or tails
of it yet. Here's the profiler output:
method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] gridsize=[ 128,
1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 672 ]
inst_issued=[ 7906011 ]
method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] gridsize=[ 1, 1 ]
threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] l1_shared_bank_conflict=[ 96 ]
method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] gridsize=[ 32,
1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 1255 ]
inst_issued=[ 2964333 ]
method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ 1, 1 ]
threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] l1_shared_bank_conflict=[ 18 ]
Second stages are comparable, but PyCUDA receives a sound beating in the
first stage. I don't quite understand why though. Code-wise, PyCUDA and
thrust do mostly the same thing--some parameters are different, but I've
twiddled them, and they don't make a big difference. From the profile,
the main killer seems to be that thrust's code simply issues three times
fewer instructions. But I don't get why--the codes aren't that
I've even made a version of reduction that's even more directly like
what thrust does:
The timing is about the same, even a tad bit slower. I'd much appreciate
any clues. Igor, can you please check if the perf difference is the same
on just a simple sum'o'floats?