now has the timing
measured with Python's time.time() -- there isn't much difference. The
card is Tesla C2070.
Igor
On Thu, May 31, 2012 at 3:31 PM, Bryan Catanzaro <bcatanzaro(a)acm.org> wrote:
Hi Igor -
I meant that it's more useful to know the execution time of code
running on the GPU from Python's perspective, since Python is the one
driving the work, and the execution overheads can be significant.
What timings do you get when you use timeit rather than CUDA events?
Also, what GPU are you running on?
- bryan
On Wed, May 30, 2012 at 5:56 PM, Igor <rychphd(a)gmail.com> wrote:
> I've updated the
http://dev.math.canterbury.ac.nz/home/pub/26/
>
> larger vector, a billion elements.
>
> As for returning the value, it's the pair of max value and position we
> are talking about, thrust returns the position and I'm now timing the
> extraction of the value from the gpu array which didn't change timing
> too much.
>
> ReductionKernel still appears 5 times slower than thrust.
>
> Bryan, on the same worksheet the numpy timing is printed as well:
> argmax is 3 times slower than ReductionKernel.
>
>
>
>
> On Thu, May 31, 2012 at 12:08 PM, Andreas Kloeckner
> <lists(a)informa.tiker.net> wrote:
>> On Wed, 30 May 2012 22:13:27 +1200, Igor <rychphd(a)gmail.com> wrote:
>>> Hi Andreas,
>>> I'm attaching an example for your wiki demonstrating how to find a max
>>> element position both using ReductionKernel and thrust-nvcc-ctypes.
>>> The latter doesn't quite work on windows yet. Should work if you're
on
>>> a linux, just change the FOLDER. There is a live version published on
>>> my sage server (
http://dev.math.canterbury.ac.nz/home/pub/26/ ) --
>>> there all work and show a discouraging 5-fold slowdown of
>>> ReductionKernel as compared to thrust (run twice, as the .so file is
>>> loaded lazily?). Could you take a look and edit it if necessary?
>>
>> Not a fair comparison. The PyCUDA test includes the transfer of the
>> result to the host. (.get()) Doesn't look like that's the case for
>> thrust. Also, an 80 MB vector is tiny. At 200 GB/s, that's about 4e-4s,
>> which is in the vicinity of launch overhead.
>>
>> Andreas
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
>
http://lists.tiker.net/listinfo/pycuda