Thank you marmaduke! For points 1 and 2, that's what i've noticed too.
Date: Fri, 6 Jul 2012 13:42:32 +0200
Subject: Re: [PyCUDA] Do tasks run in background?
You may want to hold out for a more authoritative response from someone else, but I have
noticed and write my code assuming that
- func() will launch the kernel and return (almost) immediately
- attempts to access gpuarrays involved in a launched kernel will block until launched
kernel has completed
- pycuda.driver.Context.synchronize can be called to explicitly wait for kernel launch to
complete (which is useful if you have two kernels operating on same data, as they could
otherwise run simultaneously)
On Fri, Jul 6, 2012 at 11:39 AM, Orestis K <orekost(a)hotmail.com> wrote:
I'm new to PyCUDA and GPU programming however initial experiences have been very
pleasant. I started out by some simple task and it seems blazing faster than running on a
CPU. However, I would like to confirm that it's indeed as fast as it seems.
My main question is whether after 'func' is called and access of the prompt is
regained, are there still any of the tasks running on the GPU? If so, is there a way to
block from performing the next tasks until it has finished?
I've posted the code below for reference purposes. You can change the value of N so
that it's faster. I set it very close to the limit so that I might witness a delay on
returning control of the command prompt.
Thank you in advance and please keep up the excellent work!
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gpuarray
import sys, numpy, random, string
# create random input data
N = 33500000
buf = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase +
string.digits) for x in xrange(N))
mod = SourceModule("""
__global__ void get_words(int N, char *a,unsigned int *b)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ( idx <N-3)
b[idx] = (a[idx] << 24) + (a[idx+3]);
func = mod.get_function("get_words")
# copy buffer to GPU
bufArray = cuda.mem_alloc(N)
# create results array on GPU
resArray = gpuarray.to_gpu(numpy.zeros((N-3,1),dtype=numpy.int32))
# setup parameters and execute function
threadsPerBlock = 512
blocksPerGrid = (N+threadsPerBlock-1)/threadsPerBlock
func(numpy.int32(len(buf)), bufArray, resArray, grid=(blocksPerGrid ,1),
# get back results
a = numpy.zeros(N-3,1),dtype=numpy.int32)
b = resArray.get(a)
a = a.reshape(-1).tolist()
PyCUDA mailing list