[PyCUDA] pyCUDA parallel scan performance

Алексей Гурин guralisk at mail.ru
Mon Sep 26 23:28:46 PDT 2011


Hi,
Im currently working on CFD code on pyCUDA and have one problem with parallel prefix sum. I need this function to copy elements which close to free surface to small linear array, but call of pyCUDA implementation of scan takes too much time.

Here is the test code, which executes scan for 5 arrays:

import pycuda.autoinit
import pycuda.gpuarray as gpu
import pycuda.scan as scan
import time
from pycuda.compiler import SourceModule
import numpy as np</span>

N = pow(2,15)
NArrays = 5

arrayH = np.zeros(N,dtype=np.int32)

arrayDList = []
for i in range(NArrays):
    arrayDList.append(gpu.to_gpu(arrayH))

krn = scan.InclusiveScanKernel(np.int32,"a+b")

for i in range(NArrays):
    time1 = time.time()
    krn(arrayDList[i])
    time2 = time.time()
    print "time = " + str(time2-time1)</span>

Output:

time = 0.000386953353882
time = 0.000221967697144
time = 0.000216960906982
time = 0.00021505355835
time = 0.000216007232666

CUDA Profiler output:
...
</span>method=[ scan_scan_intervals ] gputime=[ 16.640 ] cputime=[ 16.000 ] occupancy=[ 0.500 ] 
method=[ scan_scan_intervals ] gputime=[ 9.920 ] cputime=[ 5.000 ] occupancy=[ 0.125 ] 
method=[ scan_final_update ] gputime=[ 5.408 ] cputime=[ 4.000 ] occupancy=[ 1.000 ] 
...

On GPU scan takes about 30 microseconds, but call in python code takes 200. I need to call scan procedure on every timestep in my code and 200 μs is too slow (energy equation solver takes about 150 μs). Is there any way to improve parallel scan call time?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20110927/07c40622/attachment.html>


More information about the PyCUDA mailing list