[PyCUDA] pyCUDA parallel scan performance
Bogdan Opanchuk
mantihor at gmail.com
Mon Sep 26 23:53:38 PDT 2011
Hello Алексей,
As far as I can see, there are two things you may try.
1. ElementwiseKernel.__call__ calculates necessary grid and block
sizes every time, along with doing some other stuff, which can be
significant if the kernel execution time is of the order of tens of
microseconds. So you can write your own kernel and use
prepare()/prepared_call() to get rid of most of the overhead.
2. If you do not need to transfer intermediate results to CPU,
enqueueing calls to scan in a stream instead of executing them
synchronously will effectively hide part of the overhead.
Best regards,
Bogdan
On Tue, Sep 27, 2011 at 4:28 PM, Алексей Гурин <guralisk at mail.ru> wrote:
> Hi,
> Im currently working on CFD code on pyCUDA and have one problem with
> parallel prefix sum. I need this function to copy elements which close to
> free surface to small linear array, but call of pyCUDA implementation of
> scan takes too much time.
>
> Here is the test code, which executes scan for 5 arrays:
>
> import pycuda.autoinit
> import pycuda.gpuarray as gpu
> import pycuda.scan as scan
> import time
> from pycuda.compiler import SourceModule
> import numpy as np
>
> N = pow(2,15)
> NArrays = 5
>
> arrayH = np.zeros(N,dtype=np.int32)
>
> arrayDList = []
> for i in range(NArrays):
> arrayDList.append(gpu.to_gpu(arrayH))
>
> krn = scan.InclusiveScanKernel(np.int32,"a+b")
>
> for i in range(NArrays):
> time1 = time.time()
> krn(arrayDList[i])
> time2 = time.time()
> print "time = " + str(time2-time1)
>
> Output:
>
> time = 0.000386953353882
> time = 0.000221967697144
> time = 0.000216960906982
> time = 0.00021505355835
> time = 0.000216007232666
>
> CUDA Profiler output:
> ...
> method=[ scan_scan_intervals ] gputime=[ 16.640 ] cputime=[ 16.000 ]
> occupancy=[ 0.500 ]
> method=[ scan_scan_intervals ] gputime=[ 9.920 ] cputime=[ 5.000 ]
> occupancy=[ 0.125 ]
> method=[ scan_final_update ] gputime=[ 5.408 ] cputime=[ 4.000 ] occupancy=[
> 1.000 ]
> ...
>
> On GPU scan takes about 30 microseconds, but call in python code takes 200.
> I need to call scan procedure on every timestep in my code and 200 μs is too
> slow (energy equation solver takes about 150 μs). Is there any way to
> improve parallel scan call time?
> _______________________________________________
> PyCUDA mailing list
> PyCUDA at tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
>
More information about the PyCUDA
mailing list